To access genomic data stored in Google Cloud Storage, you can use Google's implementation of the htsget protocol defined by the Global Alliance for Genomics and Health.
Google's htsget implementation makes it easy to access and share data stored in your own cloud projects without copying large files to and from Compute Engine virtual machines.
You can also use the htsget server to access data from public sources like Google's mirror of the 1000 Genomes Project.
To try it out on some public data, run the following commands in Cloud Shell:
docker network create test docker run -d --network=test --name=htsget gcr.io/genomics-tools/htsget
This command starts the htsget server running and attaches it to a local docker container network named 'test'. Once it has started, you can access it using any software that speaks the GA4GH htsget protocol.
As an example, the command below uses samtools to view statistics about a small range on chromosome 11 on a public genome:
docker run --network=test gcr.io/genomics-tools/samtools flagstat "http://htsget/reads/genomics-public-data/platinum-genomes/bam/NA12892_S1.bam?referenceName=chr11&end=1000"
In just a few seconds, you should see that samtools has processed just over 1500 reads that were streamed from the BAM file stored in Google Cloud Storage:
1532 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 5 + 0 duplicates 1526 + 0 mapped (99.61% : N/A) 1532 + 0 paired in sequencing 784 + 0 read1 748 + 0 read2 1510 + 0 properly paired (98.56% : N/A) 1520 + 0 with itself and mate mapped 6 + 0 singletons (0.39% : N/A) 10 + 0 with mate mapped to a different chr 1 + 0 with mate mapped to a different chr (mapQ>=5)
For more information about the htsget server, including information on accessing private data and limiting access to your data, see the htsget README.