Running Nextflow

This page explains how to run a pipeline on Google Cloud using Nextflow.

The pipeline used in this tutorial is a proof of concept of an RNA-Seq pipeline intended to show Nextflow usage on Google Cloud.

Objectives

After completing this tutorial, you'll know how to:

  • Install Nextflow in Cloud Shell
  • Configure a Nextflow pipeline
  • Run a pipeline using Nextflow on Google Cloud

Costs

This tutorial uses billable components of Google Cloud, including:

  • Compute Engine
  • Cloud Storage

Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Cloud Platform users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Cloud Console, on the project selector page, select or create a Cloud project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.

  4. Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.

    Enable the APIs

Create a Cloud Storage bucket

Following the guidance outlined in the bucket naming guidelines, create a uniquely named bucket to store temporary work and output files throughout this tutorial. As described in the bucket naming guidelines, for DNS compatibility, this tutorial will not work with bucket names containing an underscore (_).

Console

  1. In the Cloud Console, open the Cloud Storage browser:

    Go to Cloud Storage browser

  2. Click Create bucket.

  3. In the Bucket name text box, enter a unique name for your bucket, and then click Create.

gcloud

  1. Open Cloud Shell:

    Go to Cloud Shell

  2. Run the following command to create a bucket, replacing BUCKET with a unique name for your bucket.

    gsutil mb gs://BUCKET
    

Create and activate a service account

Console

Create a service account using Cloud Console:

  1. In the Cloud Console, go to the Service Accounts page.

    Go to Service Accounts page

  2. From the Service account list, select New service account.

  3. In the Service account name field, enter nextflow-service-account.

  4. From the Role list, select the following roles:

    • Cloud Life Sciences Workflow Runner
    • Service Account User
    • Service Usage Consumer
    • Storage Object Admin
  5. Click Create. A JSON file that contains your key downloads to your computer.

You can provide authentication credentials to your application code or commands by setting the environment variable GOOGLE_APPLICATION_CREDENTIALS to the file path of the JSON file that contains your service account key.

The following steps show how to set the GOOGLE_APPLICATION_CREDENTIALS environment variable.

  1. Open Cloud Shell.

    Go to Cloud Shell

  2. From the Cloud Shell three-dotted More menu, select Upload file and select the JSON key file you just created. This will upload the file to the home directory of your Cloud Shell instance.

  3. Confirm that the uploaded file is in your present directory and confirm the file name by running the following command:

    ls
    

  4. Set the credentials, replacing KEY-FILENAME.json with the name of your key file.

    export GOOGLE_APPLICATION_CREDENTIALS=${PWD}/KEY-FILENAME.json
    

gcloud

Create a service account using Cloud Shell:

  1. Open Cloud Shell.

    Go to Cloud Shell

  2. Set the variables to be used in creating the service account, replacing PROJECT_ID with your project ID.

    export PROJECT=PROJECT_ID
    export SERVICE_ACCOUNT_NAME=nextflow-service-account
    export SERVICE_ACCOUNT_ADDRESS=${SERVICE_ACCOUNT_NAME}@${PROJECT}.iam.gserviceaccount.com
    
  3. Create the service account.

    gcloud iam service-accounts create ${SERVICE_ACCOUNT_NAME}
    
  4. The service account needs the following Cloud Identity and Access Management roles:

    • roles/lifesciences.workflowsRunner
    • roles/iam.serviceAccountUser
    • roles/serviceusage.serviceUsageConsumer
    • roles/storage.objectAdmin

    Grant these roles by running the following commands in Cloud Shell:

    gcloud projects add-iam-policy-binding ${PROJECT} \
        --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \
        --role roles/lifesciences.workflowsRunner
    
    gcloud projects add-iam-policy-binding ${PROJECT} \
        --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \
        --role roles/iam.serviceAccountUser
    
    gcloud projects add-iam-policy-binding ${PROJECT} \
        --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \
        --role roles/serviceusage.serviceUsageConsumer
    
    gcloud projects add-iam-policy-binding ${PROJECT} \
        --member serviceAccount:${SERVICE_ACCOUNT_ADDRESS} \
        --role roles/storage.objectAdmin
    

Set the GOOGLE_APPLICATION_CREDENTIALS to use the service account credentials using Cloud Shell

  1. Obtain the service account credentials, and set the private key file to the GOOGLE_APPLICATION_CREDENTIALS environment variable:

    export SERVICE_ACCOUNT_KEY=${SERVICE_ACCOUNT_NAME}-private-key.json
    gcloud iam service-accounts keys create /
      --iam-account=${SERVICE_ACCOUNT_ADDRESS} /
      --key-file-type=json ${SERVICE_ACCOUNT_KEY}
    export SERVICE_ACCOUNT_KEY_FILE=${PWD}/${SERVICE_ACCOUNT_KEY}
    export GOOGLE_APPLICATION_CREDENTIALS=${PWD}/${SERVICE_ACCOUNT_KEY}
    

Install and configure Nextflow in Cloud Shell

To avoid having to install any software on your machine, continue running all the terminal commands in this tutorial from Cloud Shell.

  1. If not already open, open Cloud Shell.

    Go to Cloud Shell

  2. Install Nextflow in Cloud Shell.

    export NXF_VER=20.01.0
    export NXF_MODE=google
    curl https://get.nextflow.io | bash
    
  3. Clone the sample pipeline repository which includes both the pipeline to run and the sample data that will be used.

    git clone https://github.com/nextflow-io/rnaseq-nf.git
    
  4. Configure Nextflow:

    1. Change to the rnaseq-nf folder.

      cd rnaseq-nf
      

    2. Using a text editor of your choice, edit the file named nextflow.config and make the following updates to the section labeled gls:

      • Add the line google.project if it is not present.
      • Replace PROJECT_ID with your project ID.
      • Add the line google.location which specifies where the Cloud Life Sciences request will be processed.
      • If desired, change the value of google.location. It must be one of the currently available Cloud Life Sciences API locations.
      • If desired, change the value of google.region which specifies the region in which the Compute Engine VMs will be launched. See available Compute Engine Regions and Zones.
      • Replace BUCKET with the bucket name created above.
      • Replace WORK_DIR with the name of a folder to use for logging and output. This should be a new directory name which does not yet exist in your bucket.
      • Note: the workDir variable location must contain at least 1 subdirectory. Do not use just the bucket name.
      gls {
         params.transcriptome = 'gs://rnaseq-nf/data/ggal/transcript.fa'
         params.reads = 'gs://rnaseq-nf/data/ggal/gut_{1,2}.fq'
         params.multiqc = 'gs://rnaseq-nf/multiqc'
         process.executor = 'google-lifesciences'
         process.container = 'nextflow/rnaseq-nf:latest'
         workDir = 'gs://BUCKET/WORK_DIR'
         google.location = 'europe-west2'
         google.region  = 'europe-west1'
         google.project = 'PROJECT_ID'
      }
      
    3. Change back to the previous folder

      cd ..
      

Run the pipeline with Nextflow

  1. Run the pipeline with Nextflow in Cloud Shell:

    ./nextflow run rnaseq-nf/main.nf -profile gls
    
  2. Nextflow will continue to run in Cloud Shell.

Viewing output of Nextflow pipeline

After the pipeline finishes, you can check the output as well as any logs, errors, commands run, and temporary files.

The final output file will be saved in the Cloud Storage instance as the file results/qc_report.html.

To check individual output files from each task as well as intermediate files:

Console

  1. In the Cloud Storage console, open the Storage Browser page:

    Go to Cloud Storage browser

  2. Go to the BUCKET and browse to the WORK_DIR specified in the nextflow.config file.

  3. There will be a folder for each of the separate tasks that was run in the pipeline.

  4. The folder will contain the commands that were run, output files, and temporary files used during the workflow.

gcloud

  1. To see the output files in Cloud Shell, first open Cloud Shell:

    Go to Cloud Shell

  2. Run the following command to list the outputs in your Cloud Storage bucket. Update BUCKET and WORK_DIR to the variables specified in the nextflow.config file.

    gsutil ls gs://BUCKET/WORK_DIR
    
  3. The output will show a folder for each of the tasks run. Continue to list the contents of subsequent subdirectories to see all the files created by the pipeline. Update TASK_FOLDER to be one of the task folders that was listed from the command above.

    gsutil ls gs://BUCKET/WORK_DIR/FOLDER/TASK_FOLDER
    

You can either view the intermediate files created by the pipeline and choose which ones you want to keep, or remove them to reduce costs associated with Cloud Storage. To remove the files, see Deleting intermediate files in your Cloud Storage bucket.

Troubleshooting

  • If you encounter problems when running the pipeline, see Cloud Life Sciences API troubleshooting.

  • If your pipeline fails, you can check the logs for each task by looking at the log files in each of the folders in Cloud Storage, such as .command.err, .command.log, .command.out, and so forth.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

After you've finished the Running the GATK Best Practices Pipeline tutorial, you can clean up the resources that you created on Google Cloud so they won't take up quota and you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Deleting intermediate files in your Cloud Storage bucket

When you run the pipeline, it stores intermediate files in gs://BUCKET/WORK_DIR. You can remove the files after the workflow completes to reduce Cloud Storage charges.

To view the amount of space used in the directory:

gsutil du -sh gs://BUCKET/WORK_DIR

To remove files from the work directory:

Console

  1. In the Cloud Storage console, open the Storage Browser page:

    Go to Cloud Storage browser

  2. Go to the BUCKET and browse to the WORK_DIR specified in the nextflow.config file.

  3. Browse through the subfolders and delete any unwanted files or directories. To delete all files, delete the entire WORK_DIR.

gcloud

  1. Open Cloud Shell and run the following:

    Go to Cloud Shell

  2. To remove all of the intermediate files in the WORK_DIR directory:

    gsutil -m rm gs://BUCKET/WORK_DIR/**
    

Deleting the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Cloud Console, go to the Manage resources page.

    Go to the Manage resources page

  2. In the project list, select the project that you want to delete and then click Delete .
  3. In the dialog, type the project ID and then click Shut down to delete the project.

What's next