Running a GATK Best Practices Pipeline

This page explains how to run a pipeline on Google Cloud Platform using the GATK Best Practices provided by the Broad Institute.

The workflow used in this tutorial is an implementation of the GATK Best Practices for variant discovery in whole genome sequencing (WGS) data. The workflow is written in the Broad Institute's Workflow Definition Language (WDL) and runs on the Cromwell WDL runner.

Objectives

After completing this tutorial, you'll know how to:

  • Run a pipeline using the GATK Best Practices with data from build 38 of the human reference genome
  • Run a pipeline using the GATK Best Practices using your own data

Costs

This tutorial uses billable components of Google Cloud Platform, including:

  • Compute Engine
  • Cloud Storage

Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Cloud Platform users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your project.

    Learn how to enable billing

  4. Enable the Cloud Genomics, Compute Engine, and Cloud Storage APIs.

    Enable the APIs

  5. Install and initialize the Cloud SDK.
  6. Update and install gcloud components:
    gcloud components update &&
    gcloud components install alpha
  7. Install git to download the required files.

    Download git

  8. By default, Compute Engine has resource quotas in place to prevent inadvertent usage. By increasing quotas, you can launch more virtual machines concurrently, increasing throughput and reducing turnaround time.

    For best results in this tutorial, you should request additional quota above your project's default. Recommendations for quota increases are provided in the following list, as well as the minimum quotas needed to run the tutorial. Make your quota requests in the us-central1 region:

    • CPUs: 101 (minimum 17)
    • Persistent Disk Standard (GB): 10,500 (minimum 320)
    • In-use IP Addresses: 51 (minimum 2)

    You can leave other quota request fields empty to keep your current quotas.

Create a Cloud Storage bucket

Create a Cloud Storage bucket using the gsutil mb command. Due to a requirement in the Cromwell engine, do not use an underscore (_) character in the bucket name or you will encounter an error.

gsutil mb gs://BUCKET

The pipeline will output results, logs, and intermediate files to this bucket.

Download the example files

Download the WDL and helper script:

git clone https://github.com/openwdl/wdl.git
git clone https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels.git

The workflows/broad-prod-wgs-germline-snps-indels repository contains the following files needed to run the pipeline:

  • *.wdl: Workflow definition
  • *.inputs.json: Input parameters, including paths to the BAM files and reference genome
  • *.options.json: Workflow runtime options

You can find the Cromwell pipeline definition file used to run WDL pipelines in the openwdl/wdl/runners/cromwell_on_google/ repository.

Run the pipeline using sample data

The pipeline runs with WGS data using build 38 of the human reference genome. The input files are unaligned BAM files.

To run the pipeline:

  1. Create the environment variable GATK_GOOGLE_DIR which points to the folder containing the Broad pipeline files:

    export GATK_GOOGLE_DIR="${PWD}"/broad-prod-wgs-germline-snps-indels
    

  2. Create the environment variable GATK_OUTPUT_DIR which points to the Cloud Storage bucket and a folder for the output of the workflow, intermediate workspace files, and logging:

    export GATK_OUTPUT_DIR=gs://BUCKET/FOLDER
    

  3. Change directory to the /cromwell_on_google/ folder in the repository you downloaded. This directory contains the pipeline definition file for running WDL-based pipelines on GCP:

    cd wdl/runners/cromwell_on_google/
    

  4. Run the pipeline:

    gcloud alpha genomics pipelines run \
      --pipeline-file wdl_runner/wdl_pipeline.yaml \
      --zones us-central1-f \
      --memory 5 \
      --logging "${GATK_OUTPUT_DIR}/logging" \
      --inputs-from-file WDL="${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.gatk4.0.wdl" \
      --inputs-from-file WORKFLOW_INPUTS="${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.hg38.inputs.json" \
      --inputs-from-file WORKFLOW_OPTIONS="${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.gatk4.0.options.json" \
      --inputs WORKSPACE="${GATK_OUTPUT_DIR}/workspace" \
      --inputs OUTPUTS="${GATK_OUTPUT_DIR}/outputs"
    

  5. The command returns an operation ID in the format Running [operations/OPERATION_ID]. You can use the operation ID to track the status of the pipeline by running the following command:

    gcloud alpha genomics operations describe OPERATION_ID \
        --format='yaml(done, error, metadata.events)'
    

  6. The operations describe command returns done: true when the pipeline finishes.

    You can run a script included with the wdl_runner to check every 300 seconds whether the job is running, has finished, or returned an error:

    ./monitoring_tools/monitor_wdl_pipeline.sh OPERATION_ID 300
    

  7. After the pipeline finishes, run the following command to list the outputs in your Cloud Storage bucket:

    gsutil ls gs://BUCKET/FOLDER/outputs/
    

You can either view the intermediate files created by the pipeline and choose which ones you want to keep, or remove them to reduce costs associated with Cloud Storage. To remove the files, see Deleting intermediate files in your Cloud Storage bucket.

Run the GATK Best Practices pipeline on your data

Before you run the pipeline on your local data, you need to copy the data into a Cloud Storage bucket.

Copy input files

The pipeline can run with unaligned BAM files stored in Cloud Storage. If your files are in a different format, such as aligned BAM or FASTQ, you must convert them before they can be uploaded to Cloud Storage. You can convert them locally, or you can use the Pipelines API to convert them in the cloud.

The following example shows how to copy a single file from a local filesystem to a Cloud Storage bucket:

gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp FILE \
    gs://BUCKET/FOLDER

For more examples of how to copy files to a Cloud Storage bucket, see the section on Copying data into Cloud Storage.

The gsutil command-line tool verifies checksums automatically, so when the transfer succeeds, your data will be compatible for use with the GATK Best Practices.

Run the pipeline on your data

To run the GATK Best Practices on your own unaligned BAM files, make a copy of PairedEndSingleSampleWf.hg38.inputs.json, then update the paths to point to your files in a Cloud Storage bucket. You can then follow the steps in Run the pipeline using sample data, using the updated PairedEndSingleSampleWf.hg38.inputs.json file.

If your data isn't made up of unaligned BAM files, and contains reference genomes, exome sequencing, targeted panels, and somatic data, you will have to use different workflows. See the GATK Support Forum and the Broad Institute GitHub repository for more information.

Troubleshooting

  • If you encounter problems when running the pipeline, see Pipelines API troubleshooting.

  • GATK has strict expectations about input file formats. To avoid problems, you can validate that your files pass ValidateSamFile.

  • If your GATK run fails, you can check the logs by running the following command:

    gsutil ls gs://BUCKET/FOLDER/logs
    

  • If you encounter permission errors, check that your service account has read access to the input files and write access to the output bucket path. If you're writing output files to a bucket in a GCP project that isn't your own, you'll need to grant the service account permission to access the bucket.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

After you've finished the Running the GATK Best Practices Pipeline tutorial, you can clean up the resources you created on Google Cloud Platform so you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Deleting intermediate files in your Cloud Storage bucket

When you run the pipeline, it stores intermediate files in gs://BUCKET/FOLDER/workspace. You can remove the files after the workflow completes to reduce Cloud Storage charges.

To view the amount of space used in the workspace directory:

gsutil du -sh gs://BUCKET/FOLDER/workspace

To remove all of the intermediate files in the workspace directory:

gsutil -m rm gs://BUCKET/FOLDER/workspace/**

Deleting the project

The easiest way to eliminate billing is to delete the project you created for the tutorial.

To delete the project:

  1. In the GCP Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

  • The Broad Institute GATK site and forums provide more complete background information, documentation, and support for the GATK tools and WDL.
Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Genomics