Orchestrate jobs by running dsub pipelines on Batch


This tutorial explains how to run a dsub pipeline on Batch. Specifically, the example dsub pipeline processes DNA sequencing data in a Binary Alignment Map (BAM) file to create a BAM index (BAI) file.

This tutorial is intended for Batch users who want to use dsub with Batch. dsub is an open source job scheduler for orchestrating batch-processing workflows on Google Cloud. To learn more about how to use Batch with dsub, see the dsub documentation for Batch.

Objectives

  • Run a dsub pipeline on Batch that reads and writes files in Cloud Storage buckets.
  • View the output files in a Cloud Storage bucket.

Costs

In this document, you use the following billable components of Google Cloud:

  • Batch
  • Cloud Storage

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

The resources created in this tutorial typically cost less than a dollar, assuming you complete all the steps—including the cleanup—in a timely manner.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. Install the Google Cloud CLI.
  3. To initialize the gcloud CLI, run the following command:

    gcloud init
  4. Create or select a Google Cloud project.

    • Create a Google Cloud project:

      gcloud projects create PROJECT_ID

      Replace PROJECT_ID with a name for the Google Cloud project you are creating.

    • Select the Google Cloud project that you created:

      gcloud config set project PROJECT_ID

      Replace PROJECT_ID with your Google Cloud project name.

  5. Make sure that billing is enabled for your Google Cloud project.

  6. Enable the Batch, Cloud Storage, Compute Engine, and Logging APIs:

    gcloud services enable batch.googleapis.com compute.googleapis.com logging.googleapis.com storage.googleapis.com
  7. Install the Google Cloud CLI.
  8. To initialize the gcloud CLI, run the following command:

    gcloud init
  9. Create or select a Google Cloud project.

    • Create a Google Cloud project:

      gcloud projects create PROJECT_ID

      Replace PROJECT_ID with a name for the Google Cloud project you are creating.

    • Select the Google Cloud project that you created:

      gcloud config set project PROJECT_ID

      Replace PROJECT_ID with your Google Cloud project name.

  10. Make sure that billing is enabled for your Google Cloud project.

  11. Enable the Batch, Cloud Storage, Compute Engine, and Logging APIs:

    gcloud services enable batch.googleapis.com compute.googleapis.com logging.googleapis.com storage.googleapis.com
  12. Make sure that your project has at least one service account with the permissions required for this tutorial.

    Each job requires a service account that allows the Batch service agent to create and access the resources required to run the job. For this tutorial, the job's service account is the Compute Engine default service account.

    To ensure that the Compute Engine default service account has the necessary permissions to allow the Batch service agent to create and access resources for Batch jobs, ask your administrator to grant the Compute Engine default service account the following IAM roles:

    For more information about granting roles, see Manage access to projects, folders, and organizations.

    Your administrator might also be able to give the Compute Engine default service account the required permissions through custom roles or other predefined roles.

  13. Make sure that you have the permissions required for this tutorial.

    To get the permissions that you need to complete this tutorial, ask your administrator to grant you the following IAM roles:

  14. Install dsub and its dependencies. For more information, see the dsub installation documentation.

    1. Make sure you have installed versions of Python and pip that are supported by the latest version of dsub. To view the currently installed versions, run the following command:

      pip --version
      

      If you need to install or update pip or Python, follow the steps for installing Python.

    2. Recommended: To prevent dependency-conflict errors when installing dsub, create and activate a Python virtual environment:

      python -m venv dsub_libs && source dsub_libs/bin/activate
      
    3. Clone the dsub GitHub repository using git and open it:

      git clone https://github.com/databiosphere/dsub.git && cd dsub
      
    4. Install dsub and its dependencies:

      python -m pip install .
      

      The output is similar to the following:

      ...
      Successfully installed cachetools-5.3.1 certifi-2023.7.22 charset-normalizer-3.3.1 dsub-0.4.9 funcsigs-1.0.2 google-api-core-2.11.0 google-api-python-client-2.85.0 google-auth-2.17.3 google-auth-httplib2-0.1.0 google-cloud-batch-0.10.0 googleapis-common-protos-1.61.0 grpcio-1.59.0 grpcio-status-1.59.0 httplib2-0.22.0 idna-3.4 mock-4.0.3 parameterized-0.8.1 proto-plus-1.22.3 protobuf-4.24.4 pyasn1-0.4.8 pyasn1-modules-0.2.8 pyparsing-3.1.1 python-dateutil-2.8.2 pytz-2023.3 pyyaml-6.0 requests-2.31.0 rsa-4.9 six-1.16.0 tabulate-0.9.0 tenacity-8.2.2 uritemplate-4.1.1 urllib3-2.0.7
      

Create a Cloud Storage bucket

To create a Cloud Storage bucket for storing the output files from the sample dsub pipeline using the gcloud CLI, run the gcloud storage buckets create command:

gcloud storage buckets create gs://BUCKET_NAME \
    --project PROJECT_ID

Replace the following:

The output is similar to the following:

Creating gs://BUCKET_NAME/...

Run the dsub pipeline

The sample dsub pipeline indexes a BAM file from the 1,000 Genomes Project and outputs the results to a Cloud Storage bucket.

To run the sample dsub pipeline, run the following dsub command:

dsub \
    --provider google-batch \
    --project PROJECT_ID \
    --logging gs://BUCKET_NAME/WORK_DIRECTORY/logs \
    --input BAM=gs://genomics-public-data/1000-genomes/bam/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam \
    --output BAI=gs://BUCKET_NAME/WORK_DIRECTORY/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai \
    --image quay.io/cancercollaboratory/dockstore-tool-samtools-index \
    --command 'samtools index ${BAM} ${BAI}' \
    --wait

Replace the following:

  • PROJECT_ID: the project ID of your Google Cloud project.

  • BUCKET_NAME: the name of the Cloud Storage bucket that you created.

  • WORK_DIRECTORY: the name for a new directory that the pipeline can use to store logs and outputs. For example, enter workDir.

The dsub pipeline runs a Batch job that writes the BAI file and logs to specified directory in your Cloud Storage bucket. Specifically, the dsub repository contains a prebuilt Docker image that uses samtools to index the BAM file that you specified in the --input flag.

The command doesn't finish until the dsub pipeline finishes running, which might vary based on when the Batch job is scheduled. Usually, this takes about 10 minutes: Batch usually starts running the job within a few minutes, and the job's runtime is about 8 minutes.

At first, the command is still running and the output is similar to the following:

Job properties:
  job-id: JOB_NAME
  job-name: samtools
  user-id: USERNAME
Provider internal-id (operation): projects/PROJECT_ID/locations/us-central1/jobs/JOB_NAME
Launched job-id: JOB_NAME
To check the status, run:
  dstat --provider google-batch --project PROJECT_ID --location us-central1 --jobs 'JOB_NAME' --users 'USERNAME' --status '*'
To cancel the job, run:
  ddel --provider google-batch --project PROJECT_ID --location us-central1 --jobs 'JOB_NAME' --users 'USERNAME'
Waiting for job to complete...
Waiting for: JOB_NAME.

Then, after the job has successfully finished, the command ends and the output is similar to the following:

  JOB_NAME: SUCCESS
JOB_NAME

This output includes the following values:

  • JOB_NAME: the name of the job.

  • USERNAME: your Google Cloud username.

  • PROJECT_ID: the project ID of your Google Cloud project.

View the output files

To view the output files created by the sample dsub pipeline using the gcloud CLI, run the gcloud storage ls command:

gcloud storage ls gs://BUCKET_NAME/WORK_DIRECTORY \
    --project PROJECT_ID

Replace the following:

  • BUCKET_NAME: the name of the Cloud Storage bucket that you created.

  • WORK_DIRECTORY: the directory that you specified in the dsub command.

  • PROJECT_ID: the project ID of your Google Cloud project.

The output is similar to the following:

gs://BUCKET_NAME/WORK_DIRECTORY/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai
gs://BUCKET_NAME/WORK_DIRECTORY/logs/

This output includes the BAI file and a directory containing the job's logs.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the project

The easiest way to eliminate billing is to delete the current project.

    Delete a Google Cloud project:

    gcloud projects delete PROJECT_ID

Delete individual resources

If you want to keep using the current project, then delete the individual resources used in this tutorial.

Delete the bucket

After the pipeline finishes running, it creates and stores output files in the WORK_DIRECTORY directory of your Cloud Storage bucket.

To reduce Cloud Storage charges to the current current Google Cloud account, do one of the following:

  • If you no longer need the bucket you used in this tutorial, then use the gcloud storage rm command with the --recursive flag to delete the bucket and all of its contents:

    gcloud storage rm gs://BUCKET_NAME \
        --recursive \
        --project PROJECT_ID
    

    Replace the following:

    • BUCKET_NAME: the name of the Cloud Storage bucket that you created.

    • PROJECT_ID: the project ID of your Google Cloud project.

  • Otherwise, if you still need the bucket, then use the gcloud storage rm command with the --recursive flag to delete only the WORK_DIRECTORY directory and all of its contents:

    gcloud storage rm gs://BUCKET_NAME/WORK_DIRECTORY \
        --recursive \
        --project PROJECT_ID
    

    Replace the following:

    • BUCKET_NAME: the name of the Cloud Storage bucket that you created.

    • WORK_DIRECTORY: the directory that you specified in the dsub command.

    • PROJECT_ID: the project ID of your Google Cloud project.

Delete the job

To delete a job using the gcloud CLI, run the gcloud batch jobs delete command.

gcloud batch jobs delete JOB_NAME \
    --location us-central1 \
    --project PROJECT_ID

Replace the following:

  • JOB_NAME: the name of the job.
  • PROJECT_ID: the project ID of your Google Cloud project.

What's next