Run GATK Best Practices

This page explains how to run a secondary genomic analysis pipeline on Google Cloud using the Genome Analysis Toolkit (GATK) Best Practices. The GATK Best Practices are provided by the Broad Institute.

The workflow used in this tutorial is an implementation of the GATK Best Practices for variant discovery in whole genome sequencing (WGS) data. The workflow is written in the Broad Institute's Workflow Definition Language (WDL) and runs on the Cromwell WDL runner.

Objectives

After completing this tutorial, you'll know how to:

  • Run a pipeline using the GATK Best Practices with data from build 38 of the human reference genome
  • Run a pipeline using the GATK Best Practices using your own data

Costs

This tutorial uses the following billable components of Google Cloud:

  • Compute Engine
  • Cloud Storage

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.

    Enable the APIs

  5. Install and initialize the Cloud SDK.
  6. Update and install gcloud components:
    gcloud components update &&
    gcloud components install beta
  7. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  8. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  9. Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.

    Enable the APIs

  10. Install and initialize the Cloud SDK.
  11. Update and install gcloud components:
    gcloud components update &&
    gcloud components install beta
  12. Install git to download the required files.

    Download git

  13. By default, Compute Engine has resource quotas in place to prevent inadvertent usage. By increasing quotas, you can launch more virtual machines concurrently, which increases throughput and reduces turnaround time.

    For best results in this tutorial, you should request additional quota above your project's default. Recommendations for quota increases are provided in the following list, and the minimum quotas needed to run the tutorial. Make your quota requests in the us-central1 region:

    • CPUs: 101 (minimum 17)
    • Persistent Disk Standard (GB): 10,500 (minimum 320)
    • In-use IP Addresses: 51 (minimum 2)

    You can leave other quota request fields empty to keep your current quotas.

Create a Cloud Storage bucket

Create a Cloud Storage bucket using the gsutil mb command. Due to a requirement in the Cromwell engine, do not use an underscore (_) character in the bucket name or you will encounter an error.

gsutil mb gs://BUCKET

The pipeline outputs results, logs, and intermediate files to this bucket.

Download the example files

Run the following commands to download the WDL and helper script:

git clone https://github.com/broadinstitute/wdl-runner.git
git clone https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels.git

The gatk-workflows/broad-prod-wgs-germline-snps-indels repository contains the following files needed to run the pipeline:

  • *.wdl: Workflow definition
  • *.inputs.json: Input parameters, including paths to the BAM files and reference genome
  • *.options.json: Workflow runtime options

You can find the Cromwell pipeline definition file used to run WDL pipelines in the broadinstitute/wdl-runner/wdl_runner/ repository.

Run the pipeline using sample data

This section shows how to run the pipeline with WGS data using build 38 of the human reference genome. The input files are unaligned BAM files.

To run the pipeline, complete the following steps:

  1. Create the environment variable GATK_GOOGLE_DIR which points to the folder containing the Broad pipeline files:

    export GATK_GOOGLE_DIR="${PWD}"/broad-prod-wgs-germline-snps-indels
    
  2. Create the environment variable GATK_OUTPUT_DIR which points to the Cloud Storage bucket and a folder for the output of the workflow, intermediate work files, and logging:

    export GATK_OUTPUT_DIR=gs://BUCKET/FOLDER
    
  3. Change directory to the /wdl_runner folder in the repository you downloaded. This directory contains the pipeline definition file for running WDL-based pipelines on Google Cloud:

    cd wdl-runner/wdl_runner/
    
  4. Run the pipeline:

    Choose one of the following options depending on if you're using a default VPC or a custom VPC:

    Default VPC

    gcloud beta lifesciences pipelines run \
    --pipeline-file wdl_pipeline.yaml \
    --location us-central1 \
    --regions us-central1 \
    --inputs-from-file WDL=${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.wdl,\
    WORKFLOW_INPUTS=${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.hg38.inputs.json,\
    WORKFLOW_OPTIONS=${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.options.json \
    --env-vars WORKSPACE=${GATK_OUTPUT_DIR}/work,\
    OUTPUTS=${GATK_OUTPUT_DIR}/output \
    --logging ${GATK_OUTPUT_DIR}/logging/
    

    Custom VPC

    1. Create the environment variables NETWORK and SUBNETWORK to specify the name of your VPC network and subnetwork:

      export NETWORK=VPC_NETWORK
      export SUBNETWORK=VPC_SUBNET
      
    2. Edit the PairedEndSingleSampleWf.options.json file located in the broad-prod-wgs-germline-snps-indels directory and modify the zones to include only zones within the region of your subnet. For example, if you are using a us-central1 subnet, the zones field would look like this: "zones": "us-central1-a us-central1-b us-central1-c us-central1-f".

    3. gcloud beta lifesciences pipelines run \
      --pipeline-file wdl_pipeline.yaml \
      --location us-central1 \
      --regions us-central1 \
      --network ${NETWORK} \
      --subnetwork ${SUBNETWORK} \
      --inputs-from-file WDL=${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.wdl,\
      WORKFLOW_INPUTS=${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.hg38.inputs.json,\
      WORKFLOW_OPTIONS=${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.options.json \
      --env-vars WORKSPACE=${GATK_OUTPUT_DIR}/work,\
      OUTPUTS=${GATK_OUTPUT_DIR}/output,\
      NETWORK=${NETWORK},\
      SUBNETWORK=${SUBNETWORK} \
      --logging ${GATK_OUTPUT_DIR}/logging/
      
  5. The command returns an operation ID in the format Running [operations/OPERATION_ID]. You can use the gcloud beta lifesciences describe command to track the status of the pipeline by running the following command (make sure that the value of the --location flag matches the location specified in the previous step):

    gcloud beta lifesciences operations describe OPERATION_ID \
        --location=us-central1 \
        --format='yaml(done, error, metadata.events)'
    
  6. The operations describe command returns done: true when the pipeline finishes.

    You can run a script included with the wdl_runner to check every 300 seconds whether the job is running, has finished, or returned an error:

    ../monitoring_tools/monitor_wdl_pipeline.sh OPERATION_ID us-central1 300
    
  7. After the pipeline finishes, run the following command to list the outputs in your Cloud Storage bucket:

    gsutil ls gs://BUCKET/FOLDER/output/
    

You can either view the intermediate files created by the pipeline and choose which ones you want to keep, or remove them to reduce costs associated with Cloud Storage. To remove the files, see Deleting intermediate files in your Cloud Storage bucket.

Run the GATK Best Practices pipeline on your data

Before you run the pipeline on your local data, you need to copy the data into a Cloud Storage bucket.

Copy input files

The pipeline can run with unaligned BAM files stored in Cloud Storage. If your files are in a different format, such as aligned BAM or FASTQ, you must convert them before they can be uploaded to Cloud Storage. You can convert them locally, or you can use the Pipelines API to convert them in the cloud.

The following example shows how to copy a single file from a local file system to a Cloud Storage bucket:

gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp FILE \
    gs://BUCKET/FOLDER

For more examples of how to copy files to a Cloud Storage bucket, see the section on Copying data into Cloud Storage.

The gsutil command-line tool verifies checksums automatically, so when the transfer succeeds, your data is compatible for use with the GATK Best Practices.

Run the pipeline on your data

To run the GATK Best Practices on your own unaligned BAM files, make a copy of PairedEndSingleSampleWf.hg38.inputs.json, then update the paths to point to your files in a Cloud Storage bucket. You can then follow the steps in Run the pipeline using sample data, using the updated PairedEndSingleSampleWf.hg38.inputs.json file.

If your data isn't made up of unaligned BAM files, and contains reference genomes, exome sequencing, targeted panels, and somatic data, you must use different workflows. See the GATK Support Forum and the Broad Institute GitHub repository for more information.

Troubleshooting

  • The pipeline is configured to use Compute Engine instances in specific regions and zones. When you run the gcloud tool, it automatically uses a default region and zone based on the location where your Google Cloud project was created. This can result in the following error message when running the pipeline:

    "ERROR: (gcloud.beta.lifesciences.pipelines.run) INVALID_ARGUMENT: Error: validating pipeline: zones and regions cannot be specified together"

    To solve this issue, remove the default region and zone by running the following commands, and then run the pipeline again:

    gcloud config unset compute/zone
    gcloud config unset compute/region
    

    For additional information on setting the default region and zone for your Google Cloud project, see Changing the default zone or region.

  • If you encounter problems when running the pipeline, see Cloud Life Sciences API troubleshooting.

  • GATK has strict expectations about input file formats. To avoid problems, you can validate that your files pass ValidateSamFile.

  • If your GATK run fails, you can check the logs by running the following command:

    gsutil ls gs://BUCKET/FOLDER/logging
    
  • If you encounter permission errors, check that your service account has read access to the input files and write access to the output bucket path. If you're writing output files to a bucket in a Google Cloud project that isn't your own, you need to grant the service account permission to access the bucket.

Cleaning up

Deleting intermediate files in your Cloud Storage bucket

When you run the pipeline, it stores intermediate files in gs://BUCKET/FOLDER/work. You can remove the files after the workflow completes to reduce Cloud Storage charges.

To view the amount of space used in the work directory, run the following command. The command might take several minutes to run due to the size of the files in the directory.

gsutil du -sh gs://BUCKET/FOLDER/work

To remove the intermediate files in the work directory, run the following command:

gsutil -m rm gs://BUCKET/FOLDER/work/**

Deleting the project

The easiest way to eliminate billing is to delete the project you used for the tutorial.

To delete the project:

  1. In the Cloud Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

  • This tutorial shows how to run a predefined workflow in a limited use case, but is not meant to be run in production. For information on how to perform genomic data processing in a production environment on Google Cloud, see Genomic data processing reference architecture.
  • The Broad Institute GATK site and forums provide more background information, documentation, and support for the GATK tools and WDL.