Run Nextflow


This page explains how to run a Nextflow pipeline on Google Cloud.

The pipeline used in this tutorial is a proof of concept of an RNA-Seq pipeline intended to show Nextflow usage on Google Cloud.

Objectives

After completing this tutorial, you'll know how to do the following:

  • Install Nextflow in Cloud Shell.
  • Configure a Nextflow pipeline.
  • Run a pipeline using Nextflow on Google Cloud.

Costs

In this document, you use the following billable components of Google Cloud:

  • Compute Engine
  • Cloud Storage

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.

    Enable the APIs

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.

    Enable the APIs

Create a Cloud Storage bucket

Create a uniquely named bucket following the guidance in bucket naming guidelines. The bucket stores temporary work and output files throughout this tutorial. For DNS compatibility, this tutorial does not work with bucket names containing an underscore (_).

Console

  1. In the Google Cloud console, go to the Cloud Storage Browser page:

    Go to Browser

  2. Click Create bucket.

  3. On the Create a bucket page, enter your bucket information.

  4. Click Create.

gsutil

  1. Open Cloud Shell:

    Go to Cloud Shell

  2. Use the gsutil mb command:

    gsutil mb gs://BUCKET_NAME
    

    Replace BUCKET_NAME with the name you want to give your bucket, subject to naming requirements. For example, my-bucket.

    If the request is successful, the command returns the following message:

    Creating gs://BUCKET_NAME/...
    

Create a service account and add roles

Complete the following steps to create a service account and add the following Identity and Access Management roles:

  • Cloud Life Sciences Workflows Runner
  • Service Account User
  • Service Usage Consumer
  • Storage Object Admin

Console

Create a service account using Google Cloud console:

  1. In the Google Cloud console, go to the Service Accounts page.

    Go to Service Accounts page

  2. Click Create service account.

  3. In the Service account name field, enter nextflow-service-account, and then click Create.

  4. In the Grant this service account access to project section, add the following roles from the Select a role drop-down list:

    • Cloud Life Sciences Workflows Runner
    • Service Account User
    • Service Usage Consumer
    • Storage Object Admin
  5. Click Continue, and then click Done.

  6. In the Service Accounts page, find the service account you created. In the service account's row, click the , and then click Manage keys.

  7. On the Keys page, click Add key, and then click Create new key.

  8. Select JSON for the Key type and click Create.

    A JSON file that contains your key downloads to your computer.

gcloud

Complete the following steps using Cloud Shell:

  1. Open Cloud Shell.

    Go to Cloud Shell

  2. Set the variables to be used in creating the service account, replacing PROJECT_ID with your project ID.

    export PROJECT=PROJECT_ID
    export SERVICE_ACCOUNT_NAME=nextflow-service-account
    export SERVICE_ACCOUNT_ADDRESS=${SERVICE_ACCOUNT_NAME}@${PROJECT}.iam.gserviceaccount.com
    
  3. Create the service account.

    gcloud iam service-accounts create ${SERVICE_ACCOUNT_NAME}
    
  4. The service account needs the following IAM roles:

    • roles/lifesciences.workflowsRunner
    • roles/iam.serviceAccountUser
    • roles/serviceusage.serviceUsageConsumer
    • roles/storage.objectAdmin

    Grant these roles by running the following commands in Cloud Shell:

    gcloud projects add-iam-policy-binding ${PROJECT} \
        --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \
        --role roles/lifesciences.workflowsRunner
    
    gcloud projects add-iam-policy-binding ${PROJECT} \
        --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \
        --role roles/iam.serviceAccountUser
    
    gcloud projects add-iam-policy-binding ${PROJECT} \
        --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \
        --role roles/serviceusage.serviceUsageConsumer
    
    gcloud projects add-iam-policy-binding ${PROJECT} \
        --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \
        --role roles/storage.objectAdmin
    

Provide credentials to your application

You can provide authentication credentials to your application code or commands by setting the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path of the JSON file that contains your service account key.

The following steps show how to set the GOOGLE_APPLICATION_CREDENTIALS environment variable:

Console

  1. Open Cloud Shell.

    Go to Cloud Shell

  2. From the Cloud Shell More menu, select Upload file, and select the JSON key file you created. The file is uploaded to the home directory of your Cloud Shell instance.

  3. Confirm that the uploaded file is in your present directory and confirm the filename by running the following command:

    ls
    

  4. Set the credentials, replacing KEY_FILENAME.json with the name of your key file.

    export GOOGLE_APPLICATION_CREDENTIALS=${PWD}/KEY_FILENAME.json
    

gcloud

Complete the following steps using Cloud Shell:

  1. Open Cloud Shell.

    Go to Cloud Shell

  2. From the Cloud Shell More menu, select Upload file, and select the JSON key file you created. The file is uploaded to the home directory of your Cloud Shell instance.

  3. Confirm that the uploaded file is in your present directory and confirm the filename by running the following command:

    ls
    

  4. Set the private key file to the GOOGLE_APPLICATION_CREDENTIALS environment variable:

    export SERVICE_ACCOUNT_KEY=${SERVICE_ACCOUNT_NAME}-private-key.json
    gcloud iam service-accounts keys create \
      --iam-account=${SERVICE_ACCOUNT_ADDRESS} \
      --key-file-type=json ${SERVICE_ACCOUNT_KEY}
    export SERVICE_ACCOUNT_KEY_FILE=${PWD}/${SERVICE_ACCOUNT_KEY}
    export GOOGLE_APPLICATION_CREDENTIALS=${PWD}/${SERVICE_ACCOUNT_KEY}
    

Install and configure Nextflow in Cloud Shell

To avoid having to install any software on your machine, continue running all the terminal commands in this tutorial from Cloud Shell.

  1. If not already open, open Cloud Shell.

    Go to Cloud Shell

  2. Install Nextflow by running the following commands:

    export NXF_VER=21.10.0
    export NXF_MODE=google
    curl https://get.nextflow.io | bash
    

    If the installation completes successfully, the following message displays:

        N E X T F L O W
    version 21.10.0 build 5430
    created 01-11-2020 15:14 UTC (10:14 EDT)
    cite doi:10.1038/nbt.3820
    http://nextflow.io
    
    Nextflow installation completed. Please note:
    ‐ the executable file `nextflow` has been created in the folder: DIRECTORY
    ‐ you may complete the installation by moving it to a directory in your $PATH
    
  3. Run the following command to clone the sample pipeline repository. The repository includes the pipeline to run and the sample data that the pipeline uses.

    git clone https://github.com/nextflow-io/rnaseq-nf.git
    
  4. Complete the following steps to configure Nextflow:

    1. Change to the rnaseq-nf folder.

      cd rnaseq-nf
      git checkout v2.0
      

    2. Using a text editor of your choice, edit the file named nextflow.config and make the following updates to the section labeled gls:

      • Add the line google.project if it is not present.
      • Replace PROJECT_ID with your project ID.
      • If desired, change the value of google.location. It must be one of the currently available Cloud Life Sciences API locations.
      • If desired, change the value of google.region which specifies the region in which the Compute Engine VMs launch. See available Compute Engine Regions and Zones.
      • Replace BUCKET with the bucket name created previously.
      • Replace WORK_DIR with the name of a folder to use for logging and output. Use a new directory name that does not yet exist in your bucket.
      gls {
         params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa'
         params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq'
         params.multiqc = 'gs://rnaseq-nf/multiqc'
         process.executor = 'google-lifesciences'
         process.container = 'nextflow/rnaseq-nf:latest'
         workDir = 'gs://BUCKET/WORK_DIR'
         google.location = 'europe-west2'
         google.region  = 'europe-west1'
         google.project = 'PROJECT_ID'
      }
      
    3. Change back to the previous folder

      cd ..
      

Run the pipeline with Nextflow

Run the pipeline with Nextflow. After starting the pipeline, it continues to run in the background until completion. It might take up to 10 minutes for the pipeline to finish.

./nextflow run rnaseq-nf/main.nf -profile gls

After the pipeline finishes, the following message displays:

N E X T F L O W  ~  version 21.10.0
Launching `rnaseq-nf/main.nf` [suspicious_mestorf] - revision: ef908c0bfd
R N A S E Q - N F   P I P E L I N E
 ===================================
 transcriptome: gs://rnaseq-nf/data/ggal/transcript.fa
 reads        : gs://rnaseq-nf/data/ggal/gut_{1,2}.fq
 outdir       : results
executor >  google-lifesciences (4)
[db/2af640] process > RNASEQ:INDEX (transcript)     [100%] 1 of 1 ✔
[a6/927725] process > RNASEQ:FASTQC (FASTQC on gut) [100%] 1 of 1 ✔
[59/438177] process > RNASEQ:QUANT (gut)            [100%] 1 of 1 ✔
[9a/9743b9] process > MULTIQC                       [100%] 1 of 1 ✔
Done! Open the following report in your browser --> results/multiqc_report.html
Completed at: DATE TIME
Duration    : 10m
CPU hours   : 0.2
Succeeded   : 4

Viewing output of Nextflow pipeline

After the pipeline finishes, you can check the output and any logs, errors, commands run, and temporary files.

The pipeline saves the final output file, results/qc_report.html, to the Cloud Storage bucket you specified in the nextflow.config file.

To check individual output files from each task and intermediate files, complete the following steps:

Console

  1. In the Cloud Storage console, open the Storage Browser page:

    Go to Cloud Storage browser

  2. Go to the BUCKET and browse to the WORK_DIR specified in the nextflow.config file.

  3. There is a folder for each of the separate tasks that was run in the pipeline.

  4. The folder contains the commands that were run, output files, and temporary files used during the workflow.

gcloud

  1. To see the output files in Cloud Shell, first open Cloud Shell:

    Go to Cloud Shell

  2. Run the following command to list the outputs in your Cloud Storage bucket. Update BUCKET and WORK_DIR to the variables specified in the nextflow.config file.

    gsutil ls gs://BUCKET/WORK_DIR
    
  3. The output shows a folder for each of the tasks run. Continue to list the contents of subsequent subdirectories to see all the files created by the pipeline. Update TASK_FOLDER one of the task folders that was listed from the previous command.

    gsutil ls gs://BUCKET/WORK_DIR/FOLDER/TASK_FOLDER
    

You can either view the intermediate files created by the pipeline and choose which ones you want to keep, or remove them to reduce costs associated with Cloud Storage. To remove the files, see Deleting intermediate files in your Cloud Storage bucket.

Troubleshooting

  • If you encounter problems when running the pipeline, see Cloud Life Sciences API troubleshooting.

  • If your pipeline fails, you can check the logs for each task by looking at the log files in each of the folders in Cloud Storage, such as .command.err, .command.log, .command.out, and so forth.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

After you finish the tutorial, you can clean up the resources that you created so that they stop using quota and incurring charges. The following sections describe how to delete or turn off these resources.

Delete intermediate files in your Cloud Storage bucket

When you run the pipeline, it stores intermediate files in gs://BUCKET/WORK_DIR. You can remove the files after the workflow completes to reduce Cloud Storage charges.

To view the amount of space used in the directory, run the following command:

gsutil du -sh gs://BUCKET/WORK_DIR

To remove files from WORK_DIR, complete the following steps:

Console

  1. In the Cloud Storage console, open the Storage Browser page:

    Go to Cloud Storage browser

  2. Go to the BUCKET and browse to the WORK_DIR specified in the nextflow.config file.

  3. Browse through the subfolders and delete any unwanted files or directories. To delete all files, delete the entire WORK_DIR.

gcloud

  1. Open Cloud Shell and run the following:

    Go to Cloud Shell

  2. To remove the intermediate files in the WORK_DIR directory, run the following command:

    gsutil -m rm gs://BUCKET/WORK_DIR/**
    

Delete the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

The following pages provide more background information, documentation, and support for using Nextflow: