Read data using htsget

This page describes how to use the Google implementation of the htsget protocol to do the following tasks:

  • Read data stored in Cloud Storage.
  • Read data from public sources such as Google's mirror of the 1000 Genomes Project.

The htsget protocol is defined by the Global Alliance for Genomics and Health (GA4GH).

Google's htsget implementation lets you access and share data stored in your own cloud projects without copying large files to and from Compute Engine virtual machines.

Read public data

To start the htsget server, run the following commands:

docker network create test
docker run -d --network=test --name=htsget gcr.io/cloud-lifesciences/htsget

Running the command attaches the server to a local Docker container network named "test". After the server starts, you can access it using any software that communicates using the GA4GH htsget protocol.

Run the following command to view statistics about a small range on chromosome 11 on a public genome:

docker run \
    --network=test gcr.io/cloud-lifesciences/samtools \
    flagstat "http://htsget/reads/genomics-public-data/platinum-genomes/bam/NA12892_S1.bam?referenceName=chr11&end=1000"

After a few seconds, the command processes about 1500 reads that were streamed from a BAM file stored in Cloud Storage:

1532 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
5 + 0 duplicates
1526 + 0 mapped (99.61% : N/A)
1532 + 0 paired in sequencing
784 + 0 read1
748 + 0 read2
1510 + 0 properly paired (98.56% : N/A)
1520 + 0 with itself and mate mapped
6 + 0 singletons (0.39% : N/A)
10 + 0 with mate mapped to a different chr
1 + 0 with mate mapped to a different chr (mapQ>=5)

For more information about the htsget server, including information on accessing private data and limiting access to your data, see the htsget README.