Reading data

To access genomic data stored in Cloud Storage, you can use Google's implementation of the htsget protocol defined by the Global Alliance for Genomics and Health.

Google's htsget implementation lets you access and share data stored in your own cloud projects without copying large files to and from Compute Engine virtual machines.

You can also use the htsget server to access data from public sources like Google's mirror of the 1000 Genomes Project.

To try it out on some public data, run the following commands in Cloud Shell:

docker network create test
docker run -d --network=test --name=htsget

This command starts the htsget server running and attaches it to a local docker container network named 'test'. Once it has started, you can access it using any software that speaks the GA4GH htsget protocol.

As an example, the command below uses samtools to view statistics about a small range on chromosome 11 on a public genome:

docker run --network=test flagstat "http://htsget/reads/genomics-public-data/platinum-genomes/bam/NA12892_S1.bam?referenceName=chr11&end=1000"

In just a few seconds, you should see that samtools has processed just over 1500 reads that were streamed from the BAM file stored in Cloud Storage:

1532 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
5 + 0 duplicates
1526 + 0 mapped (99.61% : N/A)
1532 + 0 paired in sequencing
784 + 0 read1
748 + 0 read2
1510 + 0 properly paired (98.56% : N/A)
1520 + 0 with itself and mate mapped
6 + 0 singletons (0.39% : N/A)
10 + 0 with mate mapped to a different chr
1 + 0 with mate mapped to a different chr (mapQ>=5)

For more information about the htsget server, including information on accessing private data and limiting access to your data, see the htsget README.