Running GATK Best Practices

This page explains how to run a pipeline on Google Cloud using the GATK Best Practices provided by the Broad Institute.

The workflow used in this tutorial is an implementation of the GATK Best Practices for variant discovery in whole genome sequencing (WGS) data. The workflow is written in the Broad Institute's Workflow Definition Language (WDL) and runs on the Cromwell WDL runner.

Objectives

After completing this tutorial, you'll know how to:

  • Run a pipeline using the GATK Best Practices with data from build 38 of the human reference genome
  • Run a pipeline using the GATK Best Practices using your own data

Costs

This tutorial uses billable components of Google Cloud, including:

  • Compute Engine
  • Cloud Storage

Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Cloud Platform users might be eligible for a free trial.

Before you begin

  1. Google アカウントにログインします。

    Google アカウントをまだお持ちでない場合は、新しいアカウントを登録します。

  2. GCP プロジェクトを選択または作成します。

    プロジェクト セレクタのページに移動

  3. Google Cloud Platform プロジェクトに対して課金が有効になっていることを確認します。 詳しくは、課金を有効にする方法をご覧ください。

  4. Cloud Genomics, Compute Engine, and Cloud Storage API を有効にします。

    APIを有効にする

  5. Cloud SDK をインストールして初期化します。
  6. gcloud コンポーネントを更新し、インストールします。
    gcloud components update &&
    gcloud components install alpha
  7. Install git to download the required files.

    Download git

  8. By default, Compute Engine has resource quotas in place to prevent inadvertent usage. By increasing quotas, you can launch more virtual machines concurrently, increasing throughput and reducing turnaround time.

    For best results in this tutorial, you should request additional quota above your project's default. Recommendations for quota increases are provided in the following list, as well as the minimum quotas needed to run the tutorial. Make your quota requests in the us-central1 region:

    • CPUs: 101 (minimum 17)
    • Persistent Disk Standard (GB): 10,500 (minimum 320)
    • In-use IP Addresses: 51 (minimum 2)

    You can leave other quota request fields empty to keep your current quotas.

Create a Cloud Storage bucket

Create a Cloud Storage bucket using the gsutil mb command. Due to a requirement in the Cromwell engine, do not use an underscore (_) character in the bucket name or you will encounter an error.

gsutil mb gs://BUCKET

The pipeline will output results, logs, and intermediate files to this bucket.

Download the example files

Download the WDL and helper script:

git clone https://github.com/broadinstitute/wdl-runner.git
git clone https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels.git

The gatk-workflows/broad-prod-wgs-germline-snps-indels repository contains the following files needed to run the pipeline:

  • *.wdl: Workflow definition
  • *.inputs.json: Input parameters, including paths to the BAM files and reference genome
  • *.options.json: Workflow runtime options

You can find the Cromwell pipeline definition file used to run WDL pipelines in the broadinstitute/wdl-runner/wdl_runner/ repository.

Run the pipeline using sample data

The pipeline runs with WGS data using build 38 of the human reference genome. The input files are unaligned BAM files.

To run the pipeline:

  1. Create the environment variable GATK_GOOGLE_DIR which points to the folder containing the Broad pipeline files:

    export GATK_GOOGLE_DIR="${PWD}"/broad-prod-wgs-germline-snps-indels
    
  2. Create the environment variable GATK_OUTPUT_DIR which points to the Cloud Storage bucket and a folder for the output of the workflow, intermediate workspace files, and logging:

    export GATK_OUTPUT_DIR=gs://BUCKET/FOLDER
    
  3. Change directory to the /wdl_runner folder in the repository you downloaded. This directory contains the pipeline definition file for running WDL-based pipelines on Google Cloud:

    cd wdl-runner/wdl_runner/
    
  4. Run the pipeline:

    gcloud alpha genomics pipelines run \
      --pipeline-file wdl_pipeline.yaml \
      --regions us-central1 \
      --inputs-from-file WDL=${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.wdl,\
    WORKFLOW_INPUTS=${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.hg38.inputs.json,\
    WORKFLOW_OPTIONS=${GATK_GOOGLE_DIR}/PairedEndSingleSampleWf.options.json \
      --env-vars WORKSPACE=${GATK_OUTPUT_DIR}/work,\
    OUTPUTS=${GATK_OUTPUT_DIR}/output \
      --logging ${GATK_OUTPUT_DIR}/logging/
    
  5. The command returns an operation ID in the format Running [operations/OPERATION_ID]. You can use the operation ID to track the status of the pipeline by running the following command:

    gcloud alpha genomics operations describe OPERATION_ID \
        --format='yaml(done, error, metadata.events)'
    
  6. The operations describe command returns done: true when the pipeline finishes.

    You can run a script included with the wdl_runner to check every 300 seconds whether the job is running, has finished, or returned an error:

    ../monitoring_tools/monitor_wdl_pipeline.sh OPERATION_ID 300
    
  7. After the pipeline finishes, run the following command to list the outputs in your Cloud Storage bucket:

    gsutil ls gs://BUCKET/FOLDER/outputs/
    

You can either view the intermediate files created by the pipeline and choose which ones you want to keep, or remove them to reduce costs associated with Cloud Storage. To remove the files, see Deleting intermediate files in your Cloud Storage bucket.

Run the GATK Best Practices pipeline on your data

Before you run the pipeline on your local data, you need to copy the data into a Cloud Storage bucket.

Copy input files

The pipeline can run with unaligned BAM files stored in Cloud Storage. If your files are in a different format, such as aligned BAM or FASTQ, you must convert them before they can be uploaded to Cloud Storage. You can convert them locally, or you can use the Pipelines API to convert them in the cloud.

The following example shows how to copy a single file from a local filesystem to a Cloud Storage bucket:

gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp FILE \
    gs://BUCKET/FOLDER

For more examples of how to copy files to a Cloud Storage bucket, see the section on Copying data into Cloud Storage.

The gsutil command-line tool verifies checksums automatically, so when the transfer succeeds, your data will be compatible for use with the GATK Best Practices.

Run the pipeline on your data

To run the GATK Best Practices on your own unaligned BAM files, make a copy of PairedEndSingleSampleWf.hg38.inputs.json, then update the paths to point to your files in a Cloud Storage bucket. You can then follow the steps in Run the pipeline using sample data, using the updated PairedEndSingleSampleWf.hg38.inputs.json file.

If your data isn't made up of unaligned BAM files, and contains reference genomes, exome sequencing, targeted panels, and somatic data, you will have to use different workflows. See the GATK Support Forum and the Broad Institute GitHub repository for more information.

Troubleshooting

  • The pipeline is configured to use Compute Engine instances in specific regions and zones. When you run the gcloud tool, it automatically uses a default region and zone based on the location where your Google Cloud project was created. This can result in the following error message when running the pipeline:

    "ERROR: (gcloud.alpha.genomics.pipelines.run) INVALID_ARGUMENT: Error: validating pipeline: zones and regions cannot be specified together"

    To solve this issue, remove the default region and zone by running the following commands, and then run the pipeline again:

    gcloud config unset compute/zone
    gcloud config unset compute/region
    

    For additional information on setting the default region and zone for your Google Cloud project, see Changing the default zone or region.

  • If you encounter problems when running the pipeline, see Cloud Life Sciences API troubleshooting.

  • GATK has strict expectations about input file formats. To avoid problems, you can validate that your files pass ValidateSamFile.

  • If your GATK run fails, you can check the logs by running the following command:

    gsutil ls gs://BUCKET/FOLDER/logs
    
  • If you encounter permission errors, check that your service account has read access to the input files and write access to the output bucket path. If you're writing output files to a bucket in a Google Cloud project that isn't your own, you'll need to grant the service account permission to access the bucket.

Cleaning up

Deleting intermediate files in your Cloud Storage bucket

When you run the pipeline, it stores intermediate files in gs://BUCKET/FOLDER/workspace. You can remove the files after the workflow completes to reduce Cloud Storage charges.

To view the amount of space used in the workspace directory:

gsutil du -sh gs://BUCKET/FOLDER/workspace

To remove all of the intermediate files in the workspace directory:

gsutil -m rm gs://BUCKET/FOLDER/workspace/**

Deleting the project

The easiest way to eliminate billing is to delete the project you used for the tutorial.

To delete the project:

  1. In the Cloud Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

  • The Broad Institute GATK site and forums provide more complete background information, documentation, and support for the GATK tools and WDL.
このページは役立ちましたか?評価をお願いいたします。

フィードバックを送信...

Cloud Life Sciences Documentation