Quickstart

This page shows you how to run a pipeline that uses the Cloud Genomics Pipelines API to create an index file (BAI file) from a large binary file containing DNA sequences (BAM file).

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the GCP Console, go to the Manage resources page and select or create a new project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your project.

    Learn how to enable billing

  4. Enable the Cloud Genomics, Compute Engine, and Cloud Storage JSON APIs.

    Enable the APIs

  5. Install and initialize the Cloud SDK.
  6. Alternatively, you can use Google Cloud Shell, which comes with the Cloud SDK already installed.

Run the pipeline

  1. Create a BUCKET environment variable. The variable points to a Cloud Storage bucket that uses your project name with -genomics appended.

    export BUCKET=gs://PROJECT_ID-genomics
    

  2. Create the bucket using the gsutil mb command:

    gsutil mb ${BUCKET}
    

  3. Run a pipeline using the gcloud command-line tool, providing a BAM file as the input and a BAI file as the output. The pipeline invokes the Pipelines API, creates a Compute Engine VM instance, and then runs the pipeline process on the instance. After the process finishes, the instance is automatically shut down and the BAI file is copied to your Cloud Storage bucket.

    gcloud alpha genomics pipelines run \
        --regions us-east1 \
        --command-line 'samtools index ${BAM} ${BAI}' \
        --docker-image "gcr.io/genomics-tools/samtools" \
        --inputs BAM=gs://genomics-public-data/NA12878.chr20.sample.bam \
        --outputs BAI=${BUCKET}/NA12878.chr20.sample.bam.bai
    

    If successful, the command returns the following:

    Running [projects/PROJECT_ID/operations/OPERATION_ID]
    

  4. The pipeline will take a few minutes to finish. Run the following simple bash loop to check every 30 seconds whether the pipeline has finished. Replace OPERATION_ID with the value printed in the previous step.

    while [[ $(gcloud --format='value(done)' alpha genomics operations describe OPERATION_ID) != True ]]; do
        echo "Pipeline not finished, sleeping for 30 seconds..."
        sleep 30
    done
    

    After the loop stops outputting Pipeline not finished, sleeping for 30 seconds..., run the following command to check that the pipeline finished. While the pipeline is running, the command will output False. When it has finished, the output will be True.

    gcloud --format="value(done)" alpha genomics operations describe OPERATION_ID
    

  5. Verify that the BAI file was generated:

    gsutil ls ${BUCKET}
    

    The command should return the following:

    gs://BUCKET/NA12878.chr20.sample.bam.bai
    

You've just run a pipeline using the Pipelines API to create a BAI file from a BAM file.

Clean up

  1. Use the gsutil rm command to delete the BAI file:

    gsutil rm ${BUCKET}/NA12878.chr20.sample.bam.bai
    

  2. If you created the bucket specifically for this quickstart and no longer need it, delete it using the gsutil rb command:

    gsutil rb ${BUCKET}
    

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Genomics