This page explains how to run a secondary genomic analysis pipeline on Google Cloud using the Genome Analysis Toolkit (GATK) Best Practices. The GATK Best Practices are provided by the Broad Institute.
The workflow used in this tutorial is an implementation of the GATK Best Practices for variant discovery in whole genome sequencing (WGS) data. The workflow is written in the Broad Institute's Workflow Definition Language (WDL) and runs on the Cromwell WDL runner.
Objectives
After completing this tutorial, you'll know how to:
- Run a pipeline using the GATK Best Practices with data from build 38 of the human reference genome
- Run a pipeline using the GATK Best Practices using your own data
Costs
In this document, you use the following billable components of Google Cloud:
- Compute Engine
- Cloud Storage
To generate a cost estimate based on your projected usage,
use the pricing calculator.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
Update and install
gcloud
components:gcloud components update
gcloud components install beta -
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
Update and install
gcloud
components:gcloud components update
gcloud components install beta - Install git to download the required files.
-
By default, Compute Engine has resource quotas
in place to prevent inadvertent usage. By increasing quotas, you can launch
more virtual machines concurrently, which increases throughput and reduces
turnaround time.
For best results in this tutorial, you should request additional quota above your project's default. Recommendations for quota increases are provided in the following list, and the minimum quotas needed to run the tutorial. Make your quota requests in the
us-central1
region:- CPUs: 101 (minimum 17)
- Persistent Disk Standard (GB): 10,500 (minimum 320)
- In-use IP Addresses: 51 (minimum 2)
You can leave other quota request fields empty to keep your current quotas.
Create a Cloud Storage bucket
Create a Cloud Storage bucket using the gcloud storage buckets create
command. Due to a requirement in the Cromwell engine, do not use an underscore
(_
) character in the bucket name or you will encounter an error.
gcloud storage buckets create gs://BUCKET
The pipeline outputs results, logs, and intermediate files to this bucket.
Download the example files
Run the following commands to download the WDL and helper script:
git clone https://github.com/broadinstitute/wdl-runner.git git clone https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels.git
The gatk-workflows/broad-prod-wgs-germline-snps-indels repository contains the following files needed to run the pipeline:
*.wdl
: Workflow definition*.inputs.json
: Input parameters, including paths to the BAM files and reference genome*.options.json
: Workflow runtime options
You can find the Cromwell pipeline definition file used to run WDL pipelines in the broadinstitute/wdl-runner/wdl_runner/ repository.
Run the pipeline using sample data
This section shows how to run the pipeline with WGS data using build 38 of the human reference genome. The input files are unaligned BAM files.
To run the pipeline, complete the following steps:
Create the environment variable
GATK_GOOGLE_DIR
which points to the folder containing the Broad pipeline files:export GATK_GOOGLE_DIR="${PWD}"/broad-prod-wgs-germline-snps-indels
Create the environment variable
GATK_OUTPUT_DIR
which points to the Cloud Storage bucket and a folder for theoutput
of the workflow, intermediatework
files, andlogging
:export GATK_OUTPUT_DIR=gs://BUCKET/FOLDER
Change directory to the
/wdl_runner
folder in the repository you downloaded. This directory contains the pipeline definition file for running WDL-based pipelines on Google Cloud:cd wdl-runner/wdl_runner/
Run the pipeline:
Choose one of the following options depending on if you're using a default VPC or a custom VPC:
Default VPC
gcloud beta lifesciences pipelines run \ --pipeline-file wdl_pipeline.yaml \ --location us-central1 \ --regions us-central1 \ --inputs-from-file WDL=${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.wdl,\ WORKFLOW_INPUTS=${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.hg38.inputs.json,\ WORKFLOW_OPTIONS=${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.options.json \ --env-vars WORKSPACE=${GATK_OUTPUT_DIR}/work,\ OUTPUTS=${GATK_OUTPUT_DIR}/output \ --logging ${GATK_OUTPUT_DIR}/logging/
Custom VPC
Create the environment variables
NETWORK
andSUBNETWORK
to specify the name of your VPC network and subnetwork:export NETWORK=VPC_NETWORK export SUBNETWORK=VPC_SUBNET
Edit the
PairedEndSingleSampleWf.options.json
file located in thebroad-prod-wgs-germline-snps-indels
directory and modify the zones to include only zones within the region of your subnet. For example, if you are using aus-central1
subnet, thezones
field would look like this:"zones": "us-central1-a us-central1-b us-central1-c us-central1-f"
.gcloud beta lifesciences pipelines run \ --pipeline-file wdl_pipeline.yaml \ --location us-central1 \ --regions us-central1 \ --network ${NETWORK} \ --subnetwork ${SUBNETWORK} \ --inputs-from-file WDL=${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.wdl,\ WORKFLOW_INPUTS=${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.hg38.inputs.json,\ WORKFLOW_OPTIONS=${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.options.json \ --env-vars WORKSPACE=${GATK_OUTPUT_DIR}/work,\ OUTPUTS=${GATK_OUTPUT_DIR}/output,\ NETWORK=${NETWORK},\ SUBNETWORK=${SUBNETWORK} \ --logging ${GATK_OUTPUT_DIR}/logging/
The command returns an operation ID in the format
Running [operations/OPERATION_ID]
. You can use thegcloud beta lifesciences describe
command to track the status of the pipeline by running the following command (make sure that the value of the--location
flag matches the location specified in the previous step):gcloud beta lifesciences operations describe OPERATION_ID \ --location=us-central1 \ --format='yaml(done, error, metadata.events)'
The
operations describe
command returnsdone: true
when the pipeline finishes.You can run a script included with the
wdl_runner
to check every 300 seconds whether the job is running, has finished, or returned an error:../monitoring_tools/monitor_wdl_pipeline.sh OPERATION_ID us-central1 300
After the pipeline finishes, run the following command to list the outputs in your Cloud Storage bucket:
gcloud storage ls gs://BUCKET/FOLDER/output/
You can either view the intermediate files created by the pipeline and choose which ones you want to keep, or remove them to reduce costs associated with Cloud Storage. To remove the files, see Deleting intermediate files in your Cloud Storage bucket.
Run the GATK Best Practices pipeline on your data
Before you run the pipeline on your local data, you need to copy the data into a Cloud Storage bucket.
Copy input files
The pipeline can run with unaligned BAM files stored in Cloud Storage. If your files are in a different format, such as aligned BAM or FASTQ, you must convert them before they can be uploaded to Cloud Storage. You can convert them locally, or you can use the Pipelines API to convert them in the cloud.
The following example shows how to copy a single file from a local file system to a Cloud Storage bucket:
gcloud storage cp FILE gs://BUCKET/FOLDER
For more examples of how to copy files to a Cloud Storage bucket, see the section on Copying data into Cloud Storage.
The gcloud CLI verifies checksums automatically, so when the transfer succeeds, your data is compatible for use with the GATK Best Practices.
Run the pipeline on your data
To run the GATK Best Practices on your own unaligned BAM files, make a copy of
PairedEndSingleSampleWf.hg38.inputs.json
,
then update the paths to point to your files in a Cloud Storage bucket.
You can then follow the steps in Run the pipeline using sample data,
using the updated PairedEndSingleSampleWf.hg38.inputs.json
file.
If your data isn't made up of unaligned BAM files, and contains reference genomes, exome sequencing, targeted panels, and somatic data, you must use different workflows. See the GATK Support Forum and the Broad Institute GitHub repository for more information.
Troubleshooting
The pipeline is configured to use Compute Engine instances in specific regions and zones. When you run the gcloud CLI, it automatically uses a default region and zone based on the location where your Google Cloud project was created. This can result in the following error message when running the pipeline:
"ERROR: (gcloud.beta.lifesciences.pipelines.run) INVALID_ARGUMENT: Error: validating pipeline: zones and regions cannot be specified together"
To solve this issue, remove the default region and zone by running the following commands, and then run the pipeline again:
gcloud config unset compute/zone gcloud config unset compute/region
For additional information on setting the default region and zone for your Google Cloud project, see Changing the default zone or region.
If you encounter problems when running the pipeline, see Cloud Life Sciences API troubleshooting.
GATK has strict expectations about input file formats. To avoid problems, you can validate that your files pass ValidateSamFile.
If your GATK run fails, you can check the logs by running the following command:
gcloud storage ls gs://BUCKET/FOLDER/logging
If you encounter permission errors, check that your service account has read access to the input files and write access to the output bucket path. If you're writing output files to a bucket in a Google Cloud project that isn't your own, you need to grant the service account permission to access the bucket.
Cleaning up
Deleting intermediate files in your Cloud Storage bucket
When you run the pipeline, it stores intermediate files in
gs://BUCKET/FOLDER/work
. You can
remove the files after the workflow completes to reduce
Cloud Storage charges.
To view the amount of space used in the work
directory, run the following
command. The command might take several minutes to run due to the size
of the files in the directory.
gcloud storage du gs://BUCKET/FOLDER/work --readable-sizes --summarize
To remove the intermediate files in the work
directory, run the
following command:
gcloud storage rm gs://BUCKET/FOLDER/work/**
Deleting the project
The easiest way to eliminate billing is to delete the project you used for the tutorial.
To delete the project:
- In the Google Cloud console, go to the Projects page.
- In the project list, select the project you want to delete and click Delete project.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- The Broad Institute GATK site and forums provide more background information, documentation, and support for the GATK tools and WDL.