Running a Pipeline with Nextflow

This page explains how to run a pipeline on Google Cloud Platform using the Pipelines API and Nextflow.

The pipeline used in this tutorial is a proof of concept of an RNA-Seq pipeline intended to show Nextflow usage on Google Cloud Platform.

Objectives

After completing this tutorial, you'll know how to:

  • Install Nextflow in Cloud Shell
  • Configure a Nextflow pipeline to use Pipelines API
  • Run a pipeline using Nextflow on Google Cloud Platform

Costs

This tutorial uses billable components of Google Cloud Platform, including:

  • Compute Engine
  • Cloud Storage

Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Cloud Platform users might be eligible for a free trial.

Before you begin

  1. Accede a tu Cuenta de Google.

    Si todavía no tienes una cuenta, regístrate para obtener una nueva.

  2. Selecciona o crea un proyecto de GCP.

    Ir a la página Administrar recursos

  3. Asegúrate de tener habilitada la facturación para tu proyecto.

    Aprende a habilitar la facturación

  4. Habilita las Cloud Genomics, Compute Engine, and Cloud Storage API necesarias.

    Habilita las API

Create a Cloud Storage bucket

Following the guidance outlined in the bucket and object naming guidelines, create a uniquely named bucket to store temporary work and output files throughout this tutorial.

console

  1. In the GCP Console, open the Cloud Storage browser:

    GO TO Cloud Storage BROWSER

  2. Click Create bucket.

  3. In the Bucket name text box, enter the name you selected for BUCKET, and then click Create.

gcloud

  1. Open Cloud Shell:

    GO TO Cloud Shell

  2. Create a bucket using the following commands:

    gsutil mb gs://BUCKET
    

Install and configure Nextflow in Cloud Shell

To avoid having to install any software on your machine, run all the terminal commands in this tutorial from Cloud Shell.

  1. Open Cloud Shell.

    GO TO CLOUD SHELL

  2. Install Nextflow in Cloud Shell.

    export NXF_VER=19.01.0
    export NXF_MODE=google
    curl https://get.nextflow.io | bash
    
  3. Clone the sample pipeline repository which includes both the pipeline to run and the sample data that will be used.

    git clone https://github.com/nextflow-io/rnaseq-nf.git
    
  4. Configure Nextflow to run with Pipelines API.

    1. Change to the rnaseq-nf folder.

      cd rnaseq-nf
      

    2. Copy the following text and paste it at the end of the file named nextflow.config. Replace the PROJECT_ID and REGION variables with your values. The REGION variable is the region in which to run, such as us-central1. If the section below exists in the config file, replace the sections with the values below.

      process {
         executor = 'google-pipelines'
      }
      
      cloud {
         instanceType = 'n1-standard-1'
      }
      
      google {
         project = 'PROJECT_ID'
         region = 'REGION'
      }
      
    3. Change back to the previous folder

      cd ..
      

Run the Pipeline with Nextflow

  1. Run the Pipeline with Nextflow.

    ./nextflow run rnaseq-nf/main.nf -w gs://BUCKET/WORK_DIR
    
  2. Nextflow will continue to run in Cloud Shell.

Viewing output of Nextflow Pipeline

After the pipeline finishes, you can check the output as well as any logs, errors, commands run, and temporary files.

The final output file will be saved in the Cloud Storage instance as the file results/qc_report.html.

To check individual output files from each task as well as intermediate files:

console

  1. In the Cloud Storage console, open the Storage Browser page:

    GO TO Cloud Storage BROWSER

  2. Go to the BUCKET and WORK_DIR you specified when running the pipeline.

  3. There will be a folder for each of the separate tasks that was run in the pipeline.

  4. The folder will contain the commands that were run, output files, and temporary files used during the workflow.

gcloud

  1. To see the output files in Cloud Shell, first open Cloud Shell:

    GO TO Cloud Shell

  2. Run the following command to list the outputs in your Cloud Storage bucket.

    gsutil ls gs://BUCKET/WORK_DIR
    
  3. The output will show a folder for each of the tasks run. Continue to list the contents of subsequent subdirectories to see all the files created by the pipeline.

    gsutil ls gs://BUCKET/WORK_DIR/FOLDER/TASK_FOLDER
    

You can either view the intermediate files created by the pipeline and choose which ones you want to keep, or remove them to reduce costs associated with Cloud Storage. To remove the files, see Deleting intermediate files in your Cloud Storage bucket.

Troubleshooting

  • If you encounter problems when running the pipeline, see Pipelines API troubleshooting.

  • If your pipeline fails, you can check the logs for each task by looking at the log files in each of the folders in Cloud Storage, such as .command.err, .command.log, .command.out, and so forth.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

After you've finished the Running the GATK Best Practices Pipeline tutorial, you can clean up the resources that you created on GCP so they won't take up quota and you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Deleting intermediate files in your Cloud Storage bucket

When you run the pipeline, it stores intermediate files in gs://BUCKET/WORK_DIR. You can remove the files after the workflow completes to reduce Cloud Storage charges.

To view the amount of space used in the directory:

gsutil du -sh gs://BUCKET/WORK_DIR

To remove files from the work directory:

console

  1. In the Cloud Storage console, open the Storage Browser page:

    GO TO Cloud Storage BROWSER

  2. Go to the BUCKET and WORK_DIR you specified when running the pipeline.

  3. Browse through the subfolders and delete any unwanted files or directories. To delete all files, delete the entire WORK_DIR.

gcloud

  1. Open Cloud Shell and run the following:

    GO TO Cloud Shell

  2. To remove all of the intermediate files in the WORK_DIR directory:

    gsutil -m rm gs://BUCKET/WORK_DIR/**
    

Deleting the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. En la GCP Console, dirígete a la página Proyectos.

    Ir a la página Proyectos

  2. En la lista de proyectos, selecciona el proyecto que deseas borrar y haz clic en Borrar.
  3. En el cuadro de diálogo, escribe el ID del proyecto y, luego, haz clic en Cerrar para borrar el proyecto.

What's next

¿Te sirvió esta página? Envíanos tu opinión:

Enviar comentarios sobre…