Running dsub

dsub is a command-line tool that you can use to run batch computing tasks and workflows on Google Cloud.

Objectives

After completing this tutorial, you'll know how to:

  • Run a dsub pipeline on Google Cloud that creates an index (BAI file) from a large binary file of DNA sequences (BAM file)

Costs

This tutorial uses billable components of Google Cloud, including:

  • Compute Engine
  • Cloud Storage

Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Cloud Platform users might be eligible for a free trial.

Before you begin

  1. Install Python 2.7+. For more information on setting up your Python development environment, such as installing pip on your system, see the Python Development Environment Setup Guide.
  2. Google アカウントにログインします。

    Google アカウントをまだお持ちでない場合は、新しいアカウントを登録します。

  3. GCP Console のプロジェクト セレクタのページで、GCP プロジェクトを選択または作成します。

    プロジェクト セレクタのページに移動

  4. Google Cloud Platform プロジェクトに対して課金が有効になっていることを確認します。 詳しくは、課金を有効にする方法をご覧ください。

  5. Google Genomics, Compute Engine, and Cloud Storage API を有効にします。

    APIを有効にする

Create a BAI file

Complete the following steps to create an index (BAI file) from a large binary file of DNA sequences (BAM file). The data comes from the 1,000 Genomes Project.

  1. Clone the GitHub repository googlegenomics/dsub, then change to the directory for the dsub tool. The repository contains a pre-built Docker image that uses samtools to do the indexing.

    git clone https://github.com/googlegenomics/dsub.git
    cd dsub
    
  2. Install dsub and its dependencies:

    python setup.py install
    
  3. Run the dsub tool to create the BAI file, replacing PROJECT_ID with your Google Cloud project and BUCKET with a Cloud Storage bucket to which you have write access:

    dsub \
        --project PROJECT_ID \
        --zones "us-*" \
        --logging gs://BUCKET/logs \
        --input BAM=gs://genomics-public-data/1000-genomes/bam/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam \
        --output BAI=gs://BUCKET/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai \
        --image quay.io/cancercollaboratory/dockstore-tool-samtools-index \
        --command 'samtools index ${BAM} ${BAI}' \
        --wait
    

    The samtools command runs on the data file provided with the --input flag. The pipeline writes the output file and logs to your Cloud Storage bucket.

    If you have multiple inputs, you can specify them using multiple --input flags. The inputs can be specified in any order. The following sample shows how to specify two inputs:

    ...
    --input INPUT_FILE_1=gs://PATH/TO/INPUT_FILE_1 \
    --input INPUT_FILE_2=gs://PATH/TO/INPUT_FILE_2 \
    ...
    
  4. Verify that the BAI file was generated:

    gsutil ls BUCKET
    

    The command should return the following:

    gs://BUCKET/HG00114.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai
    

Cleaning up

After you've finished the tutorial, you can clean up the resources you created on Google Cloud so you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Deleting the project

The easiest way to eliminate billing is to delete the project you used for the tutorial.

To delete the project:

  1. In the Cloud Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Read the dsub documentation on GitHub for more details and examples of how to use dsub with genomics data.

このページは役立ちましたか?評価をお願いいたします。

フィードバックを送信...

Cloud Life Sciences Documentation