Orchestrate jobs by running Nextflow pipelines on Batch

This tutorial explains how to run a Nextflow pipeline on Batch. Specifically, this tutorial runs the sample rnaseq-nf life sciences pipeline from Nextflow, which quantifies genomic features from short read data using RNA-Seq.

This tutorial is intended for Batch users who want to use Nextflow with Batch.

Nextflow is open-source software for orchestrating bioinformatics workflows.

Objectives

By completing this tutorial, you'll learn how to do the following:

Install Nextflow in Cloud Shell.
Create a Cloud Storage bucket.
Configure a Nextflow pipeline.
Run a sample pipeline using Nextflow on Batch.
View outputs of the pipeline.
Clean up to avoid incurring additional charges by doing one of the following:
- Delete a project.
- Delete individual resources.

Costs

In this document, you use the following billable components of Google Cloud:

Batch
Cloud Storage

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

The resources created in this tutorial typically cost less than a dollar, assuming you complete all the steps—including the cleanup—in a timely manner.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Install the Google Cloud CLI.

To initialize the gcloud CLI, run the following command:

gcloud init

Create or select a Google Cloud project.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Make sure that billing is enabled for your Google Cloud project.

Enable the Batch, Cloud Storage, Compute Engine, and Logging APIs:

gcloud services enable batch.googleapis.com compute.googleapis.com logging.googleapis.com storage.googleapis.com

Install the Google Cloud CLI.

To initialize the gcloud CLI, run the following command:

gcloud init

Create or select a Google Cloud project.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Make sure that billing is enabled for your Google Cloud project.

Enable the Batch, Cloud Storage, Compute Engine, and Logging APIs:

gcloud services enable batch.googleapis.com compute.googleapis.com logging.googleapis.com storage.googleapis.com

Make sure that your project has a Virtual Private Cloud (VPC) network with a valid networking configuration for this tutorial.

Note: No action is required for this step unless your organization disables default network creation by enabling the compute.skipDefaultNetworkCreation policy constraint or your project has modified its default network.

This tutorial assumes that you are using the default network. By default, Google Cloud resources use the default network, which provides the network access required for this tutorial.

Objective: Optionally, if you want a Batch job for a Nextflow pipeline to use a different network, you can modify the nextflow.config file in this tutorial to also define the job's network and subnet by using the google.batch.network and google.batch.subnetwork Nextflow fields, respectively. However, you must also make sure that the network and subnet are correctly configured for your workload. For more information, see Batch networking overview.

Make sure that your project has at least one service account with the permissions required for running the Batch job in this tutorial.

By default, jobs use the Compute Engine default service account, which is automatically granted the Editor (roles/editor) IAM role and already has all the permissions required for this tutorial.

Note: No action is required for this step unless your organization disables automatic permissions for default service accounts by enabling the iam.automaticIamGrantsForDefaultServiceAccounts policy constraint, your project has modified its Compute Engine default service account, or you want to use a different service account.

To ensure that the job's service account has the necessary permissions to allow the Batch service agent to create and access resources for Batch jobs, ask your administrator to grant the job's service account the following IAM roles:
- Batch Agent Reporter (roles/batch.agentReporter) on the project
- Storage Admin (roles/storage.admin) on the project
- (Recommended) Let jobs generate logs in Cloud Logging: Logs Writer (roles/logging.logWriter) on the project
For more information about granting roles, see Manage access to projects, folders, and organizations.

Your administrator might also be able to give the job's service account the required permissions through custom roles or other predefined roles.

Make sure that you have the permissions required for this tutorial.

Note: If you created the project that you plan to use for this tutorial, no action is required for this step. As the project creator, you are automatically granted the Owner (roles/owner) IAM role, which already has all the permissions required for this tutorial.

To get the permissions that you need to complete this tutorial, ask your administrator to grant you the following IAM roles:
- Batch Job Editor (roles/batch.jobsEditor) on the project
- Service Account User (roles/iam.serviceAccountUser) on the job's service account
- Storage Object Admin (roles/storage.objectAdmin) on the project

Install Nextflow:

curl -s -L https://github.com/nextflow-io/nextflow/releases/download/v23.04.1/nextflow | bash

The output should be similar to the following:

N E X T F L O W
version 23.04.1 build 5866
created 15-04-2023 06:51 UTC
cite doi:10.1038/nbt.3820
http://nextflow.io

Nextflow installation completed. Please note:
- the executable file `nextflow` has been created in the folder: ...
- you may complete the installation by moving it to a directory in your $PATH

Create a Cloud Storage bucket

To create a Cloud Storage bucket to store temporary work and output files from the Nextflow pipeline, use the Google Cloud console or the command-line.

Console

To create a Cloud Storage bucket using the Google Cloud console, follow these steps:

In the Google Cloud console, go to the Buckets page.

Go to Buckets
Click Create.
On the Create a bucket page, enter a globally unique name for your bucket.
Click Create.
In the Public access will be prevented window, click Confirm.

gcloud

To create a Cloud Storage bucket using the Google Cloud CLI, use the gcloud storage buckets create command.

gcloud storage buckets create gs://BUCKET_NAME

Replace BUCKET_NAME with a globally unique name for your bucket.

If the request is successful, the output should be similar to the following:

Creating gs://BUCKET_NAME/...
   ```

Configure Nextflow

To configure the Nextflow pipeline to run on Batch, follow these steps in the command-line:

Clone the sample pipeline repository:

git clone https://github.com/nextflow-io/rnaseq-nf.git

Go to the rnaseq-nf folder:
```
cd rnaseq-nf
```

Open the nextflow.config file:

nano nextflow.config

The file should contain the following gcb section:

gcb {
  params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa'
  params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq'
  params.multiqc = 'gs://rnaseq-nf/multiqc'
  process.executor = 'google-batch'
  process.container = 'quay.io/nextflow/rnaseq-nf:v1.1'
  workDir = 'gs://BUCKET_NAME/WORK_DIRECTORY'
  google.region  = 'REGION'
}

In the gcb section, do the following:
1. Replace BUCKET_NAME with the name of the Cloud Storage bucket you created in the previous steps.
2. Replace WORK_DIRECTORY with the name for a new folder that the pipeline can use to store logs and outputs.
  
  For example, enter workDir.
3. Replace REGION with the region to use.
  
  For example, enter us-central1.
4. After the google.region field, add the following fields:
  1. Add the google.project field:
```
google.project = 'PROJECT_ID'
```
    Replace PROJECT_ID with the project ID of the current Google Cloud project.
  2. If you aren't using the Compute Engine default service account as the job's service account, add the google.batch.serviceAccountEmail field:
```
google.batch.serviceAccountEmail = 'SERVICE_ACCOUNT_EMAIL'
```
    Replace SERVICE_ACCOUNT_EMAIL with the email address of the job's service account that you prepared for this tutorial.
  Objective: To learn about all the fields that you can specify to configure the Batch job for a Nextflow pipeline, see the Nextflow documentation for the Batch configuration scope.
To save your edits, do the following:
1. Press Control+S.
2. Enter Y.
3. Press Enter.

Run the pipeline

Run the sample Nextflow pipeline using the command-line:

../nextflow run nextflow-io/rnaseq-nf -profile gcb

The pipeline runs a small dataset using the settings you provided in the previous steps. This operation might take up to 10 minutes to complete.

After the pipeline finishes running, the output should be similar to the following:

N E X T F L O W  ~  version 23.04.1
Launching `https://github.com/nextflow-io/rnaseq-nf` [crazy_curry] DSL2 - revision: 88b8ef803a [master]
 R N A S E Q - N F   P I P E L I N E
 ===================================
 transcriptome: gs://rnaseq-nf/data/ggal/transcript.fa
 reads        : gs://rnaseq-nf/data/ggal/gut_{1,2}.fq
 outdir       : results

Uploading local `bin` scripts folder to gs://example-bucket/workdir/tmp/53/2847f2b832456a88a8e4cd44eec00a/bin
executor >  google-batch (4)
[67/71b856] process > RNASEQ:INDEX (transcript)     [100%] 1 of 1 ✔
[0c/2c79c6] process > RNASEQ:FASTQC (FASTQC on gut) [100%] 1 of 1 ✔
[a9/571723] process > RNASEQ:QUANT (gut)            [100%] 1 of 1 ✔
[9a/1f0dd4] process > MULTIQC                       [100%] 1 of 1 ✔

Done! Open the following report in your browser --> results/multiqc_report.html

Completed at: 20-Apr-2023 15:44:55
Duration    : 10m 13s
CPU hours   : (a few seconds)
Succeeded   : 4

View outputs of the pipeline

After the pipeline finishes running, it stores output files, logs, errors, or temporary files in the results/qc_report.html file within the WORK_DIRECTORY folder of your Cloud Storage bucket.

To check the pipeline's output files in the WORK_DIRECTORY folder of your Cloud Storage bucket, you can use the Google Cloud console or the command-line.

Console

To check the pipeline's output files using the Google Cloud console, follow these steps:

In the Google Cloud console, go to the Buckets page.

Go to Buckets
In the Name column, click the name of the bucket you created in the previous steps.
On the Bucket details page, open the WORK_DIRECTORY folder.

There is a folder for each separate tasks that the workflow run. Each folder contains the commands that were run, output files, and temporary files created by the pipeline.

gcloud

To check the pipeline's output files using the gcloud CLI, use the gcloud storage ls command.

gcloud storage ls gs://BUCKET_NAME/WORK_DIRECTORY

Replace the following:

BUCKET_NAME: the name of the bucket you created in the previous steps.
WORK_DIRECTORY: the directory you specified in the nextflow.config file.

The output lists a folder for each separate tasks that the pipeline run. Each folder contains the commands that were run, output files, and temporary files created by the pipeline.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the project

The easiest way to eliminate billing is to delete the current project.

To delete the current project, use the Google Cloud console or the gcloud CLI.

Console

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

gcloud

Delete a Google Cloud project:

gcloud projects delete PROJECT_ID

Delete individual resources

If you want to keep using the current project, then delete the individual resources used in this tutorial.

Delete the bucket

If you no longer need the bucket you used in this tutorial, then delete the bucket.

Delete the output files in the bucket

After the pipeline finishes running, it creates and stores output files in the WORK_DIRECTORY folder of your Cloud Storage bucket.

To reduce Cloud Storage charges to the current current Google Cloud account, you can delete the folder containing the pipeline's output files by using the Google Cloud console or the command-line.

Console

To delete the WORK_DIRECTORY folder, and all the output files, from your Cloud Storage bucket using the Google Cloud console, follow these steps:

In the Google Cloud console, go to the Buckets page.

Go to Buckets
In the Name column, click the name of the bucket you created in the previous steps.
On the Bucket details page, select the row containing the WORK_DIRECTORY folder, and then do the following:
1. Click Delete.
2. To confirm, enter DELETE, and then click Delete.

gcloud

To delete the WORK_DIRECTORY folder, and all the output files, from your Cloud Storage bucket using the gcloud CLI, use the gcloud storage rm command with the --recursive flag.

gcloud storage rm gs://BUCKET_NAME/WORK_DIRECTORY \
      --recursive

Replace the following:

BUCKET_NAME: the name of the bucket you specified in the previous steps.
WORK_DIRECTORY: the directory to store the pipeline output files you specified in the previous steps.

What's next

To learn more about deploying Nextflow workflows, see the Nextflow GitHub repository.
To learn more about Nextflow processes, scripting, and configuration options, see the Nextflow documentation.