This page shows you how to set up and start using Google Genomics.

Before you begin

  1. If you don't already have one, sign up for a Google Account.
  2. Sign in to your Google account.

    If you don't already have one, sign up for a new account.

  3. In the Cloud Platform Console, go to the Manage resources page and create a new project.

    Go to the Manage resources page

  4. Enable billing for your project.

    Enable billing

  5. Enable the Genomics, BigQuery, and Cloud Storage APIs.

    Enable the APIs

Launch Cloud Shell to use the command line

You can use Cloud Shell to access the Google Cloud SDK, which includes tools and libraries that you need to create and manage resources on Google Cloud Platform, including Google Genomics, Google Compute Engine, Google Cloud Storage, and BigQuery.

To launch Cloud Shell:

  1. Navigate to the project you want to use in the GCP Console.

  2. Click the Activate Google Cloud Shell button at the top of the console window.

    Activate Google Cloud Shell

    A Cloud Shell session opens inside a new frame at the bottom of the console and displays a command-line prompt.

    Cloud Shell session

Querying reads with htsget

To access genomic data stored in Google Cloud Storage, you can use Google's implementation of the htsget protocol defined by the Global Alliance for Genomics and Health.

Google's htsget implementation makes it easy to access and share data stored in your own cloud projects without copying large files to and from Compute Engine virtual machines.

You can also use the htsget server to access data from public sources like Google's mirror of the 1000 Genomes Project:

To try it out on some public data, run these commands in the Cloud Shell started from your cloud project:

docker network create test
docker run -d --network=test --name=htsget gcr.io/genomics-tools/htsget

This command starts the htsget server running and attaches it to a local docker container network named 'test'. Once it has started, you can access it using any software that speaks the GA4GH htsget protocol.

As an example, the command below uses samtools to view statistics about a small range on chromosome 11 on a public genome:

docker run --network=test gcr.io/genomics-tools/samtools flagstat "http://htsget/reads/genomics-public-data/platinum-genomes/bam/NA12892_S1.bam?referenceName=chr11&end=1000"

In just a few seconds, you should see that samtools has processed just over 1500 reads that were streamed from the BAM file stored in Google Cloud Storage:

1532 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
5 + 0 duplicates
1526 + 0 mapped (99.61% : N/A)
1532 + 0 paired in sequencing
784 + 0 read1
748 + 0 read2
1510 + 0 properly paired (98.56% : N/A)
1520 + 0 with itself and mate mapped
6 + 0 singletons (0.39% : N/A)
10 + 0 with mate mapped to a different chr
1 + 0 with mate mapped to a different chr (mapQ>=5)

For more information about the htsget server, including information on accessing private data and limiting access to your data, see the htsget README.

What's next