Running Pipelines with gcloud

Want to process raw data to get analysis-ready output? Convert formats for 1000 files? Run a command-line utility for quality control on files in a bucket? You need the pipelines API!

The Pipelines API is an easy way to create, run, and monitor jobs that execute command-line tools on Google Compute Engine VMs in a Docker container. Input files are copied from a Cloud Storage bucket to a local disk, and output files are copied back to Cloud Storage. Compute Engine instances start up quickly, and you only pay for the minutes you use (with a 10 minute minimum).

Get ready

Prerequisites

If you're going to create your own Docker images, you'll also need to install Docker.

Create a bucket for your pipeline results

To run a custom pipeline that generates output files, you'll need write access to a bucket where output results will be copied. To create a bucket, use the storage browser or run the command-line utility gsutil, included in the Cloud SDK.

gsutil mb -c MULTI_REGIONAL gs://my-bucket

Change my-bucket to a unique name that follows the bucket-naming conventions. (Fine points: The MULTI_REGIONAL setting gives you a Multi-Regional Storage bucket. By default, the bucket will be in the US, but you can change or refine the location setting with the -l option.)

Select or transfer input files

To run custom pipelines, you’ll need read access to input files in a Cloud Storage bucket. You can transfer your own files with the storage browser or gsutil, or you can run on data that’s public or shared with you. To keep it simple, the example on this page uses public data.

Create a pipeline

Choose a Docker image

The Pipelines API runs tools packaged as Docker images. You can reference any Docker image that's stored in Google Container Registry, Docker hub, as well as other public Docker repositories. And, of course, you can create your own Docker image if you don't find one ready-made.

Define the pipeline

Defining a pipeline to run from the command-line is intended to be easy, with a yaml or json file to define a pipeline job. As an example, let's run a tool called samtools that makes an index of a large binary file of DNA sequences (a BAM file). Using your favorite text editor, copy-paste this definition, and save it as samtools_index.yaml.

name: samtools_index
description: Run samtools index to generate a BAM index file
inputParameters:
- name: INPUT_FILE
  localCopy:
    disk: datadisk
    path: input.bam
outputParameters:
- name: OUTPUT_FILE
  localCopy:
    disk: datadisk
    path: input.bam.bai
resources:
  minimumCpuCores: 1
  minimumRamGb: 1
  zones:
  - us-central1-a
  - us-central1-b
  - us-central1-c
  - us-central1-f
  disks:
  - name: datadisk
    type: PERSISTENT_HDD
    sizeGb: 100
    mountPoint: /mnt/data
docker:
  imageName: quay.io/cancercollaboratory/dockstore-tool-samtools-index
  cmd: "samtools index ${INPUT_FILE}"

This is intended as a simple example, but with lots of optional fields included so you can see how they work.

Run a pipeline

Start a pipeline run

Let's run the example pipeline we created above on a file from the 1000 Genomes Project:

gcloud alpha genomics pipelines run \
  --pipeline-file samtools_index.yaml \
  --logging gs://my-bucket/my-path/logs \
  --inputs INPUT_FILE=gs://genomics-public-data/1000-genomes/bam/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam \
  --outputs OUTPUT_FILE=gs://my-bucket/my-path/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai

You can replace samtools_index.yaml with another pipeline definition, and change the INPUT_FILE and OUTPUT_FILE paths. The output is:

Running: operations/`operation-id`

You can use the operation-id to check the job status or cancel the job.

gcloud alpha genomics operations describe operation-id

When the command completes, you should see your output file has been copied to my-path:

gsutil ls gs://my-bucket/my-path/

That’s it! You have now run a pipeline. You can adapt this to run your own tools or tools already packaged as Docker images.

What's next

Learn more: