This page shows you how to run a genomics pipeline that uses the Cloud Life Sciences API to create an index file (BAI file) from a large binary file containing DNA sequences (BAM file).
Before you begin
-
Sign in to your Google Account.
If you don't already have one, sign up for a new account.
-
In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
- Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage JSON APIs.
- Install and initialize the Cloud SDK.
Alternatively, you can use Cloud Shell, which comes with the Cloud SDK already installed.
Run the pipeline
You can run the pipeline using curl
or Windows PowerShell.
curl
Create a
BUCKET
environment variable. The variable points to a Cloud Storage bucket that uses your project name with-life-sciences
appended.export BUCKET=gs://PROJECT_ID-life-sciences
Create the bucket using the
gsutil mb
command:gsutil mb ${BUCKET}
Run a pipeline using the
gcloud
command-line tool, specifying a BAM file name for the input and a BAI file name for the output. The pipeline invokes the Cloud Life Sciences API, creates a Compute Engine VM instance, and then runs the pipeline process on the instance. After the process finishes, the instance is automatically shut down and the BAI file is copied to your Cloud Storage bucket.gcloud beta lifesciences pipelines run \ --regions us-east1 \ --command-line 'samtools index ${BAM} ${BAI}' \ --docker-image "gcr.io/genomics-tools/samtools" \ --inputs BAM=gs://genomics-public-data/NA12878.chr20.sample.bam \ --outputs BAI=${BUCKET}/NA12878.chr20.sample.bam.bai
If successful, the command returns the following:
Running [projects/PROJECT_ID/operations/OPERATION_ID]
The pipeline takes a few minutes to finish. You can run the following command to track its status. Replace OPERATION_ID with the value printed in the previous step.
gcloud beta lifesciences operations wait OPERATION_ID
After the operation finishes, it returns the following message:
Waiting for [projects/PROJECT_ID/operations/OPERATION_ID]...done.
Verify that the BAI file was generated:
gsutil ls ${BUCKET}
The command should return the following:
gs://BUCKET/NA12878.chr20.sample.bam.bai
You've just run a pipeline using the Cloud Life Sciences API to create a BAI file from a BAM file.
PowerShell
Create a
BUCKET
environment variable. The variable points to a Cloud Storage bucket that uses your project name with-life-sciences
appended.$BUCKET = "gs://PROJECT_ID-life-sciences"
Create the bucket using the
gsutil mb
command:gsutil mb ${BUCKET}
Run a pipeline using the
gcloud
command-line tool, specifying a BAM file name for the input and a BAI file name for the output. The pipeline invokes the Cloud Life Sciences API, creates a Compute Engine VM instance, and then runs the pipeline process on the instance. After the process finishes, the instance is automatically shut down and the BAI file is copied to your Cloud Storage bucket.gcloud beta lifesciences pipelines run ` --regions us-east1 ` --command-line 'samtools index ${BAM} ${BAI}' ` --docker-image "gcr.io/genomics-tools/samtools" ` --inputs BAM=gs://genomics-public-data/NA12878.chr20.sample.bam ` --outputs BAI=${BUCKET}/NA12878.chr20.sample.bam.bai
If successful, the command returns the following:
Running [projects/PROJECT_ID/operations/OPERATION_ID]
The pipeline takes a few minutes to finish. You can run the following command to track its status. Replace OPERATION_ID with the value printed in the previous step.
gcloud beta lifesciences operations wait OPERATION_ID
After the operation finishes, it returns the following message:
Waiting for [projects/PROJECT_ID/operations/OPERATION_ID]...done.
Verify that the BAI file was generated:
gsutil ls ${BUCKET}
The command should return the following:
gs://BUCKET/NA12878.chr20.sample.bam.bai
You've just run a pipeline using the Cloud Life Sciences API to create a BAI file from a BAM file.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, you can clean up the resources you created on Google Cloud. The following sections describe how to delete or turn off these resources.
Delete the project
If you created the project specifically for this quickstart and no longer need it, you can delete the project. Deleting the project also deletes the Cloud Storage bucket and the BAI file.
- In the Cloud Console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete the BAI file
To delete the generated BAI file but keep the project and bucket you created,
run the gsutil rm
command:
gsutil rm ${BUCKET}/NA12878.chr20.sample.bam.bai
Delete the bucket
If you created the bucket specifically for this quickstart and no longer
need it, but want to keep your project, delete the bucket using the
gsutil rb
command. Deleting the bucket also deletes the generated BAI file.
gsutil rb ${BUCKET}
What's next
- Find public genome datasets
- Load variant data into Cloud Storage or BigQuery
- Analyze variants with BigQuery