This page explains how to run a Nextflow pipeline on Google Cloud.
The pipeline used in this tutorial is a proof of concept of an RNA-Seq pipeline intended to show Nextflow usage on Google Cloud.
Objectives
After completing this tutorial, you'll know how to do the following:
- Install Nextflow in Cloud Shell.
- Configure a Nextflow pipeline.
- Run a pipeline using Nextflow on Google Cloud.
Costs
In this document, you use the following billable components of Google Cloud:
- Compute Engine
- Cloud Storage
To generate a cost estimate based on your projected usage,
use the pricing calculator.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.
Create a Cloud Storage bucket
Create a uniquely named bucket following the guidance in
bucket naming guidelines. The bucket stores temporary work and output files
throughout this tutorial. For DNS
compatibility, this tutorial does not work with bucket names containing an
underscore (_
).
Console
In the Google Cloud console, go to the Cloud Storage Browser page:
Click Create bucket.
On the Create a bucket page, enter your bucket information.
Click Create.
gcloud
Open Cloud Shell:
Use the
gcloud storage buckets create
command:gcloud storage buckets create gs://BUCKET_NAME
Replace BUCKET_NAME with the name you want to give your bucket, subject to naming requirements. For example,
my-bucket
.If the request is successful, the command returns the following message:
Creating gs://BUCKET_NAME/...
Create a service account and add roles
Complete the following steps to create a service account and add the following Identity and Access Management roles:
- Cloud Life Sciences Workflows Runner
- Service Account User
- Service Usage Consumer
Storage Object Admin
Console
Create a service account using Google Cloud console:
In the Google Cloud console, go to the Service Accounts page.
Click Create service account.
In the Service account name field, enter
nextflow-service-account
, and then click Create.In the Grant this service account access to project section, add the following roles from the Select a role drop-down list:
- Cloud Life Sciences Workflows Runner
- Service Account User
- Service Usage Consumer
- Storage Object Admin
Click Continue, and then click Done.
In the Service Accounts page, find the service account you created. In the service account's row, click the , and then click Manage keys. button
On the Keys page, click Add key, and then click Create new key.
Select JSON for the Key type and click Create.
A JSON file that contains your key downloads to your computer.
gcloud
Complete the following steps using Cloud Shell:
Open Cloud Shell.
Set the variables to be used in creating the service account, replacing PROJECT_ID with your project ID.
export PROJECT=PROJECT_ID export SERVICE_ACCOUNT_NAME=nextflow-service-account export SERVICE_ACCOUNT_ADDRESS=${SERVICE_ACCOUNT_NAME}@${PROJECT}.iam.gserviceaccount.com
Create the service account.
gcloud iam service-accounts create ${SERVICE_ACCOUNT_NAME}
The service account needs the following IAM roles:
roles/lifesciences.workflowsRunner
roles/iam.serviceAccountUser
roles/serviceusage.serviceUsageConsumer
roles/storage.objectAdmin
Grant these roles by running the following commands in Cloud Shell:
gcloud projects add-iam-policy-binding ${PROJECT} \ --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \ --role roles/lifesciences.workflowsRunner gcloud projects add-iam-policy-binding ${PROJECT} \ --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \ --role roles/iam.serviceAccountUser gcloud projects add-iam-policy-binding ${PROJECT} \ --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \ --role roles/serviceusage.serviceUsageConsumer gcloud projects add-iam-policy-binding ${PROJECT} \ --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \ --role roles/storage.objectAdmin
Provide credentials to your application
You can provide authentication credentials to your application code or
commands by setting the environment variable
GOOGLE_APPLICATION_CREDENTIALS
to the path of the JSON file that
contains your service account key.
The following steps show how to set the GOOGLE_APPLICATION_CREDENTIALS
environment variable:
Console
Open Cloud Shell.
From the Cloud Shell More
menu, select Upload file, and select the JSON key file you created. The file is uploaded to the home directory of your Cloud Shell instance.Confirm that the uploaded file is in your present directory and confirm the filename by running the following command:
ls
Set the credentials, replacing KEY_FILENAME.json with the name of your key file.
export GOOGLE_APPLICATION_CREDENTIALS=${PWD}/KEY_FILENAME.json
gcloud
Complete the following steps using Cloud Shell:
Open Cloud Shell.
From the Cloud Shell More
menu, select Upload file, and select the JSON key file you created. The file is uploaded to the home directory of your Cloud Shell instance.Confirm that the uploaded file is in your present directory and confirm the filename by running the following command:
ls
Set the private key file to the
GOOGLE_APPLICATION_CREDENTIALS
environment variable:export SERVICE_ACCOUNT_KEY=${SERVICE_ACCOUNT_NAME}-private-key.json gcloud iam service-accounts keys create \ --iam-account=${SERVICE_ACCOUNT_ADDRESS} \ --key-file-type=json ${SERVICE_ACCOUNT_KEY} export SERVICE_ACCOUNT_KEY_FILE=${PWD}/${SERVICE_ACCOUNT_KEY} export GOOGLE_APPLICATION_CREDENTIALS=${PWD}/${SERVICE_ACCOUNT_KEY}
Install and configure Nextflow in Cloud Shell
To avoid having to install any software on your machine, continue running all the terminal commands in this tutorial from Cloud Shell.
If not already open, open Cloud Shell.
Install Nextflow by running the following commands:
export NXF_VER=21.10.0 export NXF_MODE=google curl https://get.nextflow.io | bash
If the installation completes successfully, the following message displays:
N E X T F L O W version 21.10.0 build 5430 created 01-11-2020 15:14 UTC (10:14 EDT) cite doi:10.1038/nbt.3820 http://nextflow.io Nextflow installation completed. Please note: ‐ the executable file `nextflow` has been created in the folder: DIRECTORY ‐ you may complete the installation by moving it to a directory in your $PATH
Run the following command to clone the sample pipeline repository. The repository includes the pipeline to run and the sample data that the pipeline uses.
git clone https://github.com/nextflow-io/rnaseq-nf.git
Complete the following steps to configure Nextflow:
Change to the
rnaseq-nf
folder.cd rnaseq-nf git checkout v2.0
Using a text editor of your choice, edit the file named
nextflow.config
and make the following updates to the section labeledgls
:- Add the line
google.project
if it is not present. - Replace PROJECT_ID with your project ID.
- If desired, change the value of
google.location
. It must be one of the currently available Cloud Life Sciences API locations. - If desired, change the value of
google.region
which specifies the region in which the Compute Engine VMs launch. See available Compute Engine Regions and Zones. - Replace BUCKET with the bucket name created previously.
- Replace WORK_DIR with the name of a folder to use for logging and output. Use a new directory name that does not yet exist in your bucket.
gls { params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa' params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq' params.multiqc = 'gs://rnaseq-nf/multiqc' process.executor = 'google-lifesciences' process.container = 'nextflow/rnaseq-nf:latest' workDir = 'gs://BUCKET/WORK_DIR' google.location = 'europe-west2' google.region = 'europe-west1' google.project = 'PROJECT_ID' }
- Add the line
Change back to the previous folder
cd ..
Run the pipeline with Nextflow
Run the pipeline with Nextflow. After starting the pipeline, it continues to run in the background until completion. It might take up to 10 minutes for the pipeline to finish.
./nextflow run rnaseq-nf/main.nf -profile gls
After the pipeline finishes, the following message displays:
N E X T F L O W ~ version 21.10.0 Launching `rnaseq-nf/main.nf` [suspicious_mestorf] - revision: ef908c0bfd R N A S E Q - N F P I P E L I N E =================================== transcriptome: gs://rnaseq-nf/data/ggal/transcript.fa reads : gs://rnaseq-nf/data/ggal/gut_{1,2}.fq outdir : results executor > google-lifesciences (4) [db/2af640] process > RNASEQ:INDEX (transcript) [100%] 1 of 1 ✔ [a6/927725] process > RNASEQ:FASTQC (FASTQC on gut) [100%] 1 of 1 ✔ [59/438177] process > RNASEQ:QUANT (gut) [100%] 1 of 1 ✔ [9a/9743b9] process > MULTIQC [100%] 1 of 1 ✔ Done! Open the following report in your browser --> results/multiqc_report.html Completed at: DATE TIME Duration : 10m CPU hours : 0.2 Succeeded : 4
Viewing output of Nextflow pipeline
After the pipeline finishes, you can check the output and any logs, errors, commands run, and temporary files.
The pipeline saves the final output file, results/qc_report.html
, to the Cloud Storage bucket
you specified in the nextflow.config
file.
To check individual output files from each task and intermediate files, complete the following steps:
Console
In the Cloud Storage console, open the Storage Browser page:
Go to the BUCKET and browse to the WORK_DIR specified in the
nextflow.config
file.There is a folder for each of the separate tasks that was run in the pipeline.
The folder contains the commands that were run, output files, and temporary files used during the workflow.
gcloud
To see the output files in Cloud Shell, first open Cloud Shell:
Run the following command to list the outputs in your Cloud Storage bucket. Update BUCKET and WORK_DIR to the variables specified in the
nextflow.config
file.gcloud storage ls gs://BUCKET/WORK_DIR
The output shows a folder for each of the tasks run. Continue to list the contents of subsequent subdirectories to see all the files created by the pipeline. Update TASK_FOLDER one of the task folders that was listed from the previous command.
gcloud storage ls gs://BUCKET/WORK_DIR/FOLDER/TASK_FOLDER
You can either view the intermediate files created by the pipeline and choose which ones you want to keep, or remove them to reduce costs associated with Cloud Storage. To remove the files, see Deleting intermediate files in your Cloud Storage bucket.
Troubleshooting
If you encounter problems when running the pipeline, see Cloud Life Sciences API troubleshooting.
If your pipeline fails, you can check the logs for each task by looking at the log files in each of the folders in Cloud Storage, such as
.command.err
,.command.log
,.command.out
, and so forth.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
After you finish the tutorial, you can clean up the resources that you created so that they stop using quota and incurring charges. The following sections describe how to delete or turn off these resources.
Delete intermediate files in your Cloud Storage bucket
When you run the pipeline, it stores intermediate files in
gs://BUCKET/WORK_DIR
. You can
remove the files after the workflow completes to reduce
Cloud Storage charges.
To view the amount of space used in the directory, run the following command:
gcloud storage du gs://BUCKET/WORK_DIR --readable-sizes --summarize
To remove files from WORK_DIR, complete the following steps:
Console
In the Cloud Storage console, open the Storage Browser page:
Go to the BUCKET and browse to the WORK_DIR specified in the
nextflow.config
file.Browse through the subfolders and delete any unwanted files or directories. To delete all files, delete the entire WORK_DIR.
gcloud
Open Cloud Shell and run the following:
To remove the intermediate files in the WORK_DIR directory, run the following command:
gcloud storage rm gs://BUCKET/WORK_DIR/**
Delete the project
The easiest way to eliminate billing is to delete the project that you created for the tutorial.
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
The following pages provide more background information, documentation, and support for using Nextflow: