This tutorial explains how to run a Nextflow
pipeline on Batch. Specifically, this tutorial runs the
sample rnaseq-nf
life sciences pipeline from Nextflow,
which quantifies genomic features from short read data using
RNA-Seq.
This tutorial is intended for Batch users who want to use Nextflow with Batch.
Nextflow is open-source software for orchestrating bioinformatics workflows.
Objectives
By completing this tutorial, you'll learn how to do the following:
- Install Nextflow in Cloud Shell.
- Create a Cloud Storage bucket.
- Configure a Nextflow pipeline.
- Run a sample pipeline using Nextflow on Batch.
- View outputs of the pipeline.
- Clean up to avoid incurring additional charges by doing one of the following:
- Delete a project.
- Delete individual resources.
Costs
In this document, you use the following billable components of Google Cloud:
- Batch
- Cloud Storage
To generate a cost estimate based on your projected usage,
use the pricing calculator.
The resources created in this tutorial typically cost less than a dollar, assuming you complete all the steps—including the cleanup—in a timely manner.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
Create or select a Google Cloud project.
-
Create a Google Cloud project:
gcloud projects create PROJECT_ID
Replace
PROJECT_ID
with a name for the Google Cloud project you are creating. -
Select the Google Cloud project that you created:
gcloud config set project PROJECT_ID
Replace
PROJECT_ID
with your Google Cloud project name.
-
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Batch, Cloud Storage, Compute Engine, and Logging APIs:
gcloud services enable batch.googleapis.com
compute.googleapis.com logging.googleapis.com storage.googleapis.com - Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
Create or select a Google Cloud project.
-
Create a Google Cloud project:
gcloud projects create PROJECT_ID
Replace
PROJECT_ID
with a name for the Google Cloud project you are creating. -
Select the Google Cloud project that you created:
gcloud config set project PROJECT_ID
Replace
PROJECT_ID
with your Google Cloud project name.
-
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Batch, Cloud Storage, Compute Engine, and Logging APIs:
gcloud services enable batch.googleapis.com
compute.googleapis.com logging.googleapis.com storage.googleapis.com -
Make sure that your project has a Virtual Private Cloud (VPC) network with a valid networking configuration for this tutorial.
This tutorial assumes that you are using the
default
network. By default, Google Cloud resources use thedefault
network, which provides the network access required for this tutorial. -
Make sure that your project has at least one service account with the permissions required for running the Batch job in this tutorial.
By default, jobs use the Compute Engine default service account, which is automatically granted the Editor (
roles/editor
) IAM role and already has all the permissions required for this tutorial.To ensure that the job's service account has the necessary permissions to allow the Batch service agent to create and access resources for Batch jobs, ask your administrator to grant the job's service account the following IAM roles:
-
Batch Agent Reporter (
roles/batch.agentReporter
) on the project -
Storage Admin (
roles/storage.admin
) on the project -
(Recommended) Let jobs generate logs in Cloud Logging:
Logs Writer (
roles/logging.logWriter
) on the project
For more information about granting roles, see Manage access to projects, folders, and organizations.
Your administrator might also be able to give the job's service account the required permissions through custom roles or other predefined roles.
-
Batch Agent Reporter (
-
Make sure that you have the permissions required for this tutorial.
To get the permissions that you need to complete this tutorial, ask your administrator to grant you the following IAM roles:
-
Batch Job Editor (
roles/batch.jobsEditor
) on the project -
Service Account User (
roles/iam.serviceAccountUser
) on the job's service account -
Storage Object Admin (
roles/storage.objectAdmin
) on the project
-
Batch Job Editor (
-
Install Nextflow:
curl -s -L https://github.com/nextflow-io/nextflow/releases/download/v23.04.1/nextflow | bash
The output should be similar to the following:
N E X T F L O W version 23.04.1 build 5866 created 15-04-2023 06:51 UTC cite doi:10.1038/nbt.3820 http://nextflow.io Nextflow installation completed. Please note: - the executable file `nextflow` has been created in the folder: ... - you may complete the installation by moving it to a directory in your $PATH
Create a Cloud Storage bucket
To create a Cloud Storage bucket to store temporary work and output files from the Nextflow pipeline, use the Google Cloud console or the command-line.
Console
To create a Cloud Storage bucket using the Google Cloud console, follow these steps:
In the Google Cloud console, go to the Buckets page.
Click
Create.On the Create a bucket page, enter a globally unique name for your bucket.
Click Create.
In the Public access will be prevented window, click Confirm.
gcloud
To create a Cloud Storage bucket using the Google Cloud CLI,
use the
gcloud storage buckets create
command.
gcloud storage buckets create gs://BUCKET_NAME
Replace BUCKET_NAME
with a
globally unique name for your bucket.
If the request is successful, the output should be similar to the following:
Creating gs://BUCKET_NAME/...
```
Configure Nextflow
To configure the Nextflow pipeline to run on Batch, follow these steps in the command-line:
Clone the sample pipeline repository:
git clone https://github.com/nextflow-io/rnaseq-nf.git
Go to the
rnaseq-nf
folder:cd rnaseq-nf
Open the
nextflow.config
file:nano nextflow.config
The file should contain the following
gcb
section:gcb { params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa' params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq' params.multiqc = 'gs://rnaseq-nf/multiqc' process.executor = 'google-batch' process.container = 'quay.io/nextflow/rnaseq-nf:v1.1' workDir = 'gs://BUCKET_NAME/WORK_DIRECTORY' google.region = 'REGION' }
In the
gcb
section, do the following:Replace
BUCKET_NAME
with the name of the Cloud Storage bucket you created in the previous steps.Replace
WORK_DIRECTORY
with the name for a new folder that the pipeline can use to store logs and outputs.For example, enter
workDir
.Replace
REGION
with the region to use.For example, enter
us-central1
.After the
google.region
field, add the following fields:Add the
google.project
field:google.project = 'PROJECT_ID'
Replace
PROJECT_ID
with the project ID of the current Google Cloud project.If you aren't using the Compute Engine default service account as the job's service account, add the
google.batch.serviceAccountEmail
field:google.batch.serviceAccountEmail = 'SERVICE_ACCOUNT_EMAIL'
Replace
SERVICE_ACCOUNT_EMAIL
with the email address of the job's service account that you prepared for this tutorial.
To save your edits, do the following:
Press
Control+S
.Enter
Y
.Press
Enter
.
Run the pipeline
Run the sample Nextflow pipeline using the command-line:
../nextflow run nextflow-io/rnaseq-nf -profile gcb
The pipeline runs a small dataset using the settings you provided in the previous steps. This operation might take up to 10 minutes to complete.
After the pipeline finishes running, the output should be similar to the following:
N E X T F L O W ~ version 23.04.1
Launching `https://github.com/nextflow-io/rnaseq-nf` [crazy_curry] DSL2 - revision: 88b8ef803a [master]
R N A S E Q - N F P I P E L I N E
===================================
transcriptome: gs://rnaseq-nf/data/ggal/transcript.fa
reads : gs://rnaseq-nf/data/ggal/gut_{1,2}.fq
outdir : results
Uploading local `bin` scripts folder to gs://example-bucket/workdir/tmp/53/2847f2b832456a88a8e4cd44eec00a/bin
executor > google-batch (4)
[67/71b856] process > RNASEQ:INDEX (transcript) [100%] 1 of 1 ✔
[0c/2c79c6] process > RNASEQ:FASTQC (FASTQC on gut) [100%] 1 of 1 ✔
[a9/571723] process > RNASEQ:QUANT (gut) [100%] 1 of 1 ✔
[9a/1f0dd4] process > MULTIQC [100%] 1 of 1 ✔
Done! Open the following report in your browser --> results/multiqc_report.html
Completed at: 20-Apr-2023 15:44:55
Duration : 10m 13s
CPU hours : (a few seconds)
Succeeded : 4
View outputs of the pipeline
After the pipeline finishes running, it stores output files, logs, errors, or
temporary files in the results/qc_report.html
file within the
WORK_DIRECTORY
folder of your Cloud Storage
bucket.
To check the pipeline's output files in the
WORK_DIRECTORY
folder of your Cloud Storage
bucket, you can use the Google Cloud console or the command-line.
Console
To check the pipeline's output files using the Google Cloud console, follow these steps:
In the Google Cloud console, go to the Buckets page.
In the Name column, click the name of the bucket you created in the previous steps.
On the Bucket details page, open the
WORK_DIRECTORY
folder.
There is a folder for each separate tasks that the workflow run. Each folder contains the commands that were run, output files, and temporary files created by the pipeline.
gcloud
To check the pipeline's output files using the gcloud CLI, use
the
gcloud storage ls
command.
gcloud storage ls gs://BUCKET_NAME/WORK_DIRECTORY
Replace the following:
BUCKET_NAME
: the name of the bucket you created in the previous steps.WORK_DIRECTORY
: the directory you specified in thenextflow.config
file.
The output lists a folder for each separate tasks that the pipeline run. Each folder contains the commands that were run, output files, and temporary files created by the pipeline.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the project
The easiest way to eliminate billing is to delete the current project.
To delete the current project, use the Google Cloud console or the gcloud CLI.
Console
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
gcloud
Delete a Google Cloud project:
gcloud projects delete PROJECT_ID
Delete individual resources
If you want to keep using the current project, then delete the individual resources used in this tutorial.
Delete the bucket
If you no longer need the bucket you used in this tutorial, then delete the bucket.
Delete the output files in the bucket
After the pipeline finishes running, it creates and stores output files in the
WORK_DIRECTORY
folder of your Cloud Storage
bucket.
To reduce Cloud Storage charges to the current current Google Cloud account, you can delete the folder containing the pipeline's output files by using the Google Cloud console or the command-line.
Console
To delete the WORK_DIRECTORY
folder, and all the
output files, from your Cloud Storage bucket using the
Google Cloud console, follow these steps:
In the Google Cloud console, go to the Buckets page.
In the Name column, click the name of the bucket you created in the previous steps.
On the Bucket details page, select the row containing the
WORK_DIRECTORY
folder, and then do the following:Click Delete.
To confirm, enter
DELETE
, and then click Delete.
gcloud
To delete the WORK_DIRECTORY
folder, and all the
output files, from your Cloud Storage bucket using the
gcloud CLI, use the
gcloud storage rm
command with the
--recursive
flag.
gcloud storage rm gs://BUCKET_NAME/WORK_DIRECTORY \
--recursive
Replace the following:
BUCKET_NAME
: the name of the bucket you specified in the previous steps.WORK_DIRECTORY
: the directory to store the pipeline output files you specified in the previous steps.
What's next
To learn more about deploying Nextflow workflows, see the Nextflow GitHub repository.
To learn more about Nextflow processes, scripting, and configuration options, see the Nextflow documentation.