This page explains how to run a pipeline on Google Cloud using Nextflow.
The pipeline used in this tutorial is a proof of concept of an RNA-Seq pipeline intended to show Nextflow usage on Google Cloud.
Objectives
After completing this tutorial, you'll know how to:
- Install Nextflow in Cloud Shell.
- Configure a Nextflow pipeline.
- Run a pipeline using Nextflow on Google Cloud.
Costs
This tutorial uses billable components of Google Cloud, including:
- Compute Engine
- Cloud Storage
Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Google Cloud users might be eligible for a free trial.
Before you begin
-
Sign in to your Google Account.
If you don't already have one, sign up for a new account.
-
In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
- Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.
Create a Cloud Storage bucket
Following the guidance outlined in the
bucket naming guidelines,
create a uniquely named bucket to store temporary work and output files
throughout this tutorial. As described in the bucket naming guidelines, for DNS
compatibility, this tutorial will not work with bucket names containing an
underscore (_
).
Console
In the Cloud Console, open the Cloud Storage browser:
Click Create bucket.
In the Bucket name text box, enter a unique name for your bucket, and then click Create.
gcloud
Open Cloud Shell:
Run the following command to create a bucket, replacing BUCKET with a unique name for your bucket.
gsutil mb gs://BUCKET
Create a service account and add roles
Complete the following steps to create a service account and add the relevant IAM roles:
Console
Create a service account using Cloud Console:
In the Cloud Console, go to the Service Accounts page.
Click Create service account.
In the Service account name field, enter
nextflow-service-account
, and then click Create.In the Grant this service account access to project section, add the following roles from the Select a role drop-down list:
- Cloud Life Sciences Workflows Runner
- Service Account User
- Service Usage Consumer
- Storage Object Admin
Click Continue, and then click Done.
In the Service Accounts page, find the service account you created. In the same row, click More , then click Create key.
In the Create private key for "nextflow-service-account" window that appears, complete the following steps:
- Under Key type, select JSON.
- Click Create.
A JSON file that contains your key downloads to your computer.
gcloud
Complete the following steps using Cloud Shell:
Open Cloud Shell.
Set the variables to be used in creating the service account, replacing PROJECT_ID with your project ID.
export PROJECT=PROJECT_ID export SERVICE_ACCOUNT_NAME=nextflow-service-account export SERVICE_ACCOUNT_ADDRESS=${SERVICE_ACCOUNT_NAME}@${PROJECT}.iam.gserviceaccount.com
Create the service account.
gcloud iam service-accounts create ${SERVICE_ACCOUNT_NAME}
The service account needs the following Identity and Access Management roles:
roles/lifesciences.workflowsRunner
roles/iam.serviceAccountUser
roles/serviceusage.serviceUsageConsumer
roles/storage.objectAdmin
Grant these roles by running the following commands in Cloud Shell:
gcloud projects add-iam-policy-binding ${PROJECT} \ --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \ --role roles/lifesciences.workflowsRunner gcloud projects add-iam-policy-binding ${PROJECT} \ --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \ --role roles/iam.serviceAccountUser gcloud projects add-iam-policy-binding ${PROJECT} \ --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \ --role roles/serviceusage.serviceUsageConsumer gcloud projects add-iam-policy-binding ${PROJECT} \ --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \ --role roles/storage.objectAdmin
Provide credentials to your application
You can provide authentication credentials to your application code or
commands by setting the environment variable
GOOGLE_APPLICATION_CREDENTIALS
to the file path of the JSON file that
contains your service account key.
The following steps show how to set the GOOGLE_APPLICATION_CREDENTIALS
environment variable:
Console
Open Cloud Shell.
From the Cloud Shell More
menu, select Upload file and select the JSON key file you just created. This step uploads the file to the home directory of your Cloud Shell instance.Confirm that the uploaded file is in your present directory and confirm the file name by running the following command:
ls
Set the credentials, replacing KEY-FILENAME.json with the name of your key file.
export GOOGLE_APPLICATION_CREDENTIALS=${PWD}/KEY-FILENAME.json
gcloud
Complete the following steps using Cloud Shell:
Open Cloud Shell.
Set the private key file to the
GOOGLE_APPLICATION_CREDENTIALS
environment variable:export SERVICE_ACCOUNT_KEY=${SERVICE_ACCOUNT_NAME}-private-key.json gcloud iam service-accounts keys create \ --iam-account=${SERVICE_ACCOUNT_ADDRESS} \ --key-file-type=json ${SERVICE_ACCOUNT_KEY} export SERVICE_ACCOUNT_KEY_FILE=${PWD}/${SERVICE_ACCOUNT_KEY} export GOOGLE_APPLICATION_CREDENTIALS=${PWD}/${SERVICE_ACCOUNT_KEY}
Install and configure Nextflow in Cloud Shell
To avoid having to install any software on your machine, continue running all the terminal commands in this tutorial from Cloud Shell.
If not already open, open Cloud Shell.
Run the following commands to install Nextflow:
export NXF_VER=20.10.0 export NXF_MODE=google curl https://get.nextflow.io | bash
If the installation completes successfully, the following message displays:
N E X T F L O W version 20.10.0 build 5430 created 01-11-2020 15:14 UTC (10:14 EDT) cite doi:10.1038/nbt.3820 http://nextflow.io Nextflow installation completed. Please note: ‐ the executable file `nextflow` has been created in the folder: DIRECTORY ‐ you may complete the installation by moving it to a directory in your $PATH
Run the following command to clone the sample pipeline repository. The repository includes the pipeline to run and the sample data that the pipeline uses.
git clone https://github.com/nextflow-io/rnaseq-nf.git
Complete the following steps to configure Nextflow:
Change to the
rnaseq-nf
folder.cd rnaseq-nf git checkout v2.0
Using a text editor of your choice, edit the file named
nextflow.config
and make the following updates to the section labeledgls
:- Add the line
google.project
if it is not present. - Replace PROJECT_ID with your project ID.
- If desired, change the value of
google.location
. It must be one of the currently available Cloud Life Sciences API locations. - If desired, change the value of
google.region
which specifies the region in which the Compute Engine VMs will be launched. See available Compute Engine Regions and Zones. - Replace BUCKET with the bucket name created above.
- Replace WORK_DIR with the name of a folder to use for logging and output. Use a new directory name that does not yet exist in your bucket.
- Note: the
workDir
variable location must contain at least 1 subdirectory. Do not use just the bucket name.
gls { params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa' params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq' params.multiqc = 'gs://rnaseq-nf/multiqc' process.executor = 'google-lifesciences' process.container = 'nextflow/rnaseq-nf:latest' workDir = 'gs://BUCKET/WORK_DIR' google.location = 'europe-west2' google.region = 'europe-west1' google.project = 'PROJECT_ID' }
- Add the line
Change back to the previous folder
cd ..
Run the pipeline with Nextflow
Run the pipeline with Nextflow. After starting the pipeline, it will continue to run in the background until it finishes. It might take up to 10 minutes for the pipeline to finish.
./nextflow run rnaseq-nf/main.nf -profile gls
After the pipeline finishes, the following message displays:
N E X T F L O W ~ version 20.10.0 Launching `rnaseq-nf/main.nf` [suspicious_mestorf] - revision: ef908c0bfd R N A S E Q - N F P I P E L I N E =================================== transcriptome: gs://rnaseq-nf/data/ggal/transcript.fa reads : gs://rnaseq-nf/data/ggal/gut_{1,2}.fq outdir : results executor > google-lifesciences (4) [db/2af640] process > RNASEQ:INDEX (transcript) [100%] 1 of 1 ✔ [a6/927725] process > RNASEQ:FASTQC (FASTQC on gut) [100%] 1 of 1 ✔ [59/438177] process > RNASEQ:QUANT (gut) [100%] 1 of 1 ✔ [9a/9743b9] process > MULTIQC [100%] 1 of 1 ✔ Done! Open the following report in your browser --> results/multiqc_report.html Completed at: DATE TIME Duration : 10m CPU hours : 0.2 Succeeded : 4
Viewing output of Nextflow pipeline
After the pipeline finishes, you can check the output as well as any logs, errors, commands run, and temporary files.
The pipeline saves the final output file, results/qc_report.html
, to the Cloud Storage bucket
you specified in the nextflow.config
file.
To check individual output files from each task and intermediate files, complete the following steps:
Console
In the Cloud Storage console, open the Storage Browser page:
Go to the BUCKET and browse to the WORK_DIR specified in the
nextflow.config
file.There will be a folder for each of the separate tasks that was run in the pipeline.
The folder will contain the commands that were run, output files, and temporary files used during the workflow.
gcloud
To see the output files in Cloud Shell, first open Cloud Shell:
Run the following command to list the outputs in your Cloud Storage bucket. Update BUCKET and WORK_DIR to the variables specified in the
nextflow.config
file.gsutil ls gs://BUCKET/WORK_DIR
The output will show a folder for each of the tasks run. Continue to list the contents of subsequent subdirectories to see all the files created by the pipeline. Update TASK_FOLDER to be one of the task folders that was listed from the command above.
gsutil ls gs://BUCKET/WORK_DIR/FOLDER/TASK_FOLDER
You can either view the intermediate files created by the pipeline and choose which ones you want to keep, or remove them to reduce costs associated with Cloud Storage. To remove the files, see Deleting intermediate files in your Cloud Storage bucket.
Troubleshooting
If you encounter problems when running the pipeline, see Cloud Life Sciences API troubleshooting.
If your pipeline fails, you can check the logs for each task by looking at the log files in each of the folders in Cloud Storage, such as
.command.err
,.command.log
,.command.out
, and so forth.
Cleaning up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
After you've finished the Running the GATK Best Practices Pipeline tutorial, you can clean up the resources that you created on Google Cloud so they won't take up quota and you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.
Deleting intermediate files in your Cloud Storage bucket
When you run the pipeline, it stores intermediate files in
gs://BUCKET/WORK_DIR
. You can
remove the files after the workflow completes to reduce
Cloud Storage charges.
To view the amount of space used in the directory:
gsutil du -sh gs://BUCKET/WORK_DIR
To remove files from the work directory:
Console
In the Cloud Storage console, open the Storage Browser page:
Go to the BUCKET and browse to the WORK_DIR specified in the
nextflow.config
file.Browse through the subfolders and delete any unwanted files or directories. To delete all files, delete the entire WORK_DIR.
gcloud
Open Cloud Shell and run the following:
To remove all of the intermediate files in the WORK_DIR directory:
gsutil -m rm gs://BUCKET/WORK_DIR/**
Deleting the project
The easiest way to eliminate billing is to delete the project that you created for the tutorial.
To delete the project:
- In the Cloud Console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- The Nextflow site, Nextflow GitHub repository, and Nextflow documentation provide more complete background information, documentation, and support for using Nextflow.