This page explains how to run a pipeline on Google Cloud using the GATK Best Practices provided by the Broad Institute.
The workflow used in this tutorial is an implementation of the GATK Best Practices for variant discovery in whole genome sequencing (WGS) data. The workflow is written in the Broad Institute's Workflow Definition Language (WDL) and runs on the Cromwell WDL runner.
Objectives
After completing this tutorial, you'll know how to:
- Run a pipeline using the GATK Best Practices with data from build 38 of the human reference genome
- Run a pipeline using the GATK Best Practices using your own data
Costs
This tutorial uses billable components of Google Cloud, including:
- Compute Engine
- Cloud Storage
Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Cloud Platform users might be eligible for a free trial.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
- Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.
- Install and initialize the Cloud SDK.
-
Update and install
gcloud
components:gcloud components update &&
gcloud components install beta - Install git to download the required files.
-
By default, Compute Engine has resource quotas
in place to prevent inadvertent usage. By increasing quotas, you can launch
more virtual machines concurrently, increasing throughput and reducing
turnaround time.
For best results in this tutorial, you should request additional quota above your project's default. Recommendations for quota increases are provided in the following list, as well as the minimum quotas needed to run the tutorial. Make your quota requests in the
us-central1
region:- CPUs: 101 (minimum 17)
- Persistent Disk Standard (GB): 10,500 (minimum 320)
- In-use IP Addresses: 51 (minimum 2)
You can leave other quota request fields empty to keep your current quotas.
Create a Cloud Storage bucket
Create a Cloud Storage bucket using the gsutil mb
command. Due to a
requirement in the Cromwell engine, do not use an underscore (_
) character
in the bucket name or you will encounter an error.
gsutil mb gs://BUCKET
The pipeline will output results, logs, and intermediate files to this bucket.
Download the example files
Download the WDL and helper script:
git clone https://github.com/broadinstitute/wdl-runner.git git clone https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels.git
The gatk-workflows/broad-prod-wgs-germline-snps-indels repository contains the following files needed to run the pipeline:
*.wdl
: Workflow definition*.inputs.json
: Input parameters, including paths to the BAM files and reference genome*.options.json
: Workflow runtime options
You can find the Cromwell pipeline definition file used to run WDL pipelines in the broadinstitute/wdl-runner/wdl_runner/ repository.
Run the pipeline using sample data
The pipeline runs with WGS data using build 38 of the human reference genome. The input files are unaligned BAM files.
To run the pipeline:
Create the environment variable
GATK_GOOGLE_DIR
which points to the folder containing the Broad pipeline files:export GATK_GOOGLE_DIR="${PWD}"/broad-prod-wgs-germline-snps-indels
Create the environment variable
GATK_OUTPUT_DIR
which points to the Cloud Storage bucket and a folder for theoutput
of the workflow, intermediatework
files, andlogging
:export GATK_OUTPUT_DIR=gs://BUCKET/FOLDER
Change directory to the
/wdl_runner
folder in the repository you downloaded. This directory contains the pipeline definition file for running WDL-based pipelines on Google Cloud:cd wdl-runner/wdl_runner/
Run the pipeline:
gcloud beta lifesciences pipelines run \ --pipeline-file wdl_pipeline.yaml \ --location us-central1 \ --regions us-central1 \ --inputs-from-file WDL=${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.wdl,\ WORKFLOW_INPUTS=${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.hg38.inputs.json,\ WORKFLOW_OPTIONS=${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.options.json \ --env-vars WORKSPACE=${GATK_OUTPUT_DIR}/work,\ OUTPUTS=${GATK_OUTPUT_DIR}/output \ --logging ${GATK_OUTPUT_DIR}/logging/
The command returns an operation ID in the format
Running [operations/OPERATION_ID]
. You can use the operation ID to track the status of the pipeline by running the following command (make sure that the value of the--location
flag matches the location specified in the previous step):gcloud beta lifesciences operations describe OPERATION_ID \ --location=us-central1 \ --format='yaml(done, error, metadata.events)'
The
operations describe
command returnsdone: true
when the pipeline finishes.You can run a script included with the
wdl_runner
to check every 300 seconds whether the job is running, has finished, or returned an error:../monitoring_tools/monitor_wdl_pipeline.sh OPERATION_ID us-central1 300
After the pipeline finishes, run the following command to list the outputs in your Cloud Storage bucket:
gsutil ls gs://BUCKET/FOLDER/output/
You can either view the intermediate files created by the pipeline and choose which ones you want to keep, or remove them to reduce costs associated with Cloud Storage. To remove the files, see Deleting intermediate files in your Cloud Storage bucket.
Run the GATK Best Practices pipeline on your data
Before you run the pipeline on your local data, you need to copy the data into a Cloud Storage bucket.
Copy input files
The pipeline can run with unaligned BAM files stored in Cloud Storage. If your files are in a different format, such as aligned BAM or FASTQ, you must convert them before they can be uploaded to Cloud Storage. You can convert them locally, or you can use the Pipelines API to convert them in the cloud.
The following example shows how to copy a single file from a local filesystem to a Cloud Storage bucket:
gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp FILE \ gs://BUCKET/FOLDER
For more examples of how to copy files to a Cloud Storage bucket, see the section on Copying data into Cloud Storage.
The gsutil
command-line tool verifies checksums automatically, so when the
transfer succeeds, your data will be compatible for use with the GATK Best
Practices.
Run the pipeline on your data
To run the GATK Best Practices on your own unaligned BAM files, make a copy of
PairedEndSingleSampleWf.hg38.inputs.json
,
then update the paths to point to your files in a Cloud Storage bucket.
You can then follow the steps in Run the pipeline using sample data,
using the updated PairedEndSingleSampleWf.hg38.inputs.json
file.
If your data isn't made up of unaligned BAM files, and contains reference genomes, exome sequencing, targeted panels, and somatic data, you will have to use different workflows. See the GATK Support Forum and the Broad Institute GitHub repository for more information.
Troubleshooting
The pipeline is configured to use Compute Engine instances in specific regions and zones. When you run the
gcloud
tool, it automatically uses a default region and zone based on the location where your Google Cloud project was created. This can result in the following error message when running the pipeline:"ERROR: (gcloud.beta.lifesciences.pipelines.run) INVALID_ARGUMENT: Error: validating pipeline: zones and regions cannot be specified together"
To solve this issue, remove the default region and zone by running the following commands, and then run the pipeline again:
gcloud config unset compute/zone gcloud config unset compute/region
For additional information on setting the default region and zone for your Google Cloud project, see Changing the default zone or region.
If you encounter problems when running the pipeline, see Cloud Life Sciences API troubleshooting.
GATK has strict expectations about input file formats. To avoid problems, you can validate that your files pass ValidateSamFile.
If your GATK run fails, you can check the logs by running the following command:
gsutil ls gs://BUCKET/FOLDER/logging
If you encounter permission errors, check that your service account has read access to the input files and write access to the output bucket path. If you're writing output files to a bucket in a Google Cloud project that isn't your own, you'll need to grant the service account permission to access the bucket.
Cleaning up
Deleting intermediate files in your Cloud Storage bucket
When you run the pipeline, it stores intermediate files in
gs://BUCKET/FOLDER/work
. You can
remove the files after the workflow completes to reduce
Cloud Storage charges.
To view the amount of space used in the work
directory, run the following
command. The command might take several minutes to run due to the size
of the files in the directory.
gsutil du -sh gs://BUCKET/FOLDER/work
To remove all of the intermediate files in the work
directory, run the
following command:
gsutil -m rm gs://BUCKET/FOLDER/work/**
Deleting the project
The easiest way to eliminate billing is to delete the project you used for the tutorial.
To delete the project:
- In the Cloud Console, go to the Projects page.
-
In the project list, select the project you
want to delete and click Delete project.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- This tutorial shows how to run a predefined workflow in a limited use case, but is not meant to be run in production. For information on how to perform genomic data processing in a production environment on Google Cloud, see Genomic data processing reference architecture.
- The Broad Institute GATK site and forums provide more complete background information, documentation, and support for the GATK tools and WDL.