Loading Genomic Variants

Google Cloud provides multiple ways to store and work with genomic variants. This tutorial shows you how to do the first steps:

  1. Store raw variant files, such as VCFs and Complete Genomics masterVar files in Google Cloud Storage
  2. Import variants, for API access, into Google Genomics
  3. Export variants, for ad hoc queries, to Google BigQuery

Before you begin

  1. Complete the Quickstart.
  2. To get the command-line client, install the Cloud SDK and Genomics commands.
  3. Enable billing for your project:
    1. Open the billing page for the project you have selected or created.
    2. Click Enable billing.

    Note: Enabling billing does not necessarily mean you will be charged. See Pricing for more information.

Step 1: Create a Google Cloud Storage bucket

If you do not already have a Google Cloud Storage bucket, you can add one through the Google Cloud Platform Console or using the gsutil tool from the Cloud SDK.

Step 2: Upload variants to Google Cloud Storage

Transfer the data

For this tutorial, we have available some uncompressed VCF files from the Illumina Platinum Genomes dataset. You can also import your own variant files. VCFs for importing to Google Genomics must be stored uncompressed in Cloud Storage

The following commands demonstrate a few different ways of loading data.

To transfer 3 samples from our Genomic public data:

gsutil cp \
    gs://genomics-public-data/platinum-genomes/vcf/NA1287*_S1.genome.vcf \
    gs://my-bucket/platinum-genomes/vcf/

To transfer a group of local files using a grouping pattern:

gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp *.vcf \
    gs://my-bucket/platinum-genomes/vcf/

To transfer a local directory tree of files:

gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp -R \
    my-vcf-directory \
    gs://my-bucket/platinum-genomes/

If any failures occur due to temporary network issues, re-run with the no-clobber flag to transmit just the missing files:

gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp -n -R \
    my-vcf-directory \
    gs://my-bucket/platinum-genomes/

For more information on transferring data to Cloud Storage, see Using Google Cloud Storage with Big Data.

Check the data

If you have copied the Platinum Genomes samples, then when you are done, the bucket should contain:

gsutil ls gs://my-bucket/platinum-genomes/vcf
gs://my-bucket/platinum-genomes/vcf/NA12877_S1.genome.vcf
gs://my-bucket/platinum-genomes/vcf/NA12878_S1.genome.vcf
gs://my-bucket/platinum-genomes/vcf/NA12879_S1.genome.vcf

For more information on listing bucket contents, see the gsutil ls command.

Step 3: Import variants to Google Genomics

Create a Google Genomics dataset to hold your data

  • my-dataset-name: This can be any name you like such as “My Copy of Platinum Genomes”.
gcloud alpha genomics datasets create --name my-dataset-name
Created dataset my-dataset-name, id: dataset-id

Note dataset-id, which you need in the next step.

For more detail, see managing datasets.

Create a variantset

  • my-variantset-name: This can be any name you like such as “My Copy of Platinum Genomes Variants”.
gcloud alpha genomics variantsets create \
  --dataset-id dataset-id \
  --name my-variantset-name
Created variant set id: variantset-id "my-variantset-name", belonging to dataset id: dataset-id

Note variantset-id, which you need in the next step.

For more detail, see managing variants.

Import your VCFs from Google Cloud Storage to your Google Genomics dataset

  • variantset-id: This was returned in the output of the prior command.
gcloud alpha genomics variants import \
  --variantset-id variantset-id \
  --source-uris gs://my-bucket/platinum-genomes/vcf/*.vcf
done: false
name: operation-id

Note operation-id, which you need in the next step.

If you have multiple variant files (multiple VCF or multiple masterVar files) to import into the same variant set, it is preferable to specify them all in the same import:

  • When multiple files are to be imported, it will be faster to import them together.
  • Calls for a given sample will be given the same callset ID.

If variant files for a given sample are imported separately, each sample will be treated as a different callset and assigned a different callset ID.

Multiple files to import can be specified as a comma-delimited list of URIs for the source-uris flag. Wildcards are supported.

Check the import operation for completion

  • operation-id: This was returned in the output of the prior command.

Checking operation details

An operation can be examined with the command:

gcloud alpha genomics operations describe operation-id

A detailed description of the Operation resource can be found in the API documentation.

Polling operation for completion

To check whether an operation is completed, you can explicitly check the done status:

gcloud --format='default(error,done)' alpha genomics operations describe operation-id

To poll for completion of an operation, you can write a simple loop in bash:

OP_ID="operation-id"
while [[ $(gcloud --format='value(done)' alpha genomics operations describe "${OP_ID}") != True ]]; do
  echo "Sleeping for 30 seconds"
  sleep 30
done
gcloud --format='default(name,error,done)' alpha genomics operations describe "${OP_ID}"

The loop will exit when the operation has completed and the subsequent command will output the operation including errors, if any.

Verify operation success

When an operation completes:

  • The done field will be changed from false to true.
  • If the operation has failed, the error field will be set.
  • If the operation has succeeded, some operation types will set the result field as described in the API documentation.

Step 4: Export variants to Google BigQuery

Do not export variants to Google BigQuery until the variant import has completed successfully.

Create a BigQuery dataset in the web UI to hold the data.

  1. Open the BigQuery web UI.
  2. Click the down arrow icon down-arrow next to your project name in the navigation, then click Create new dataset.
  3. Input a dataset ID.

Export variants to BigQuery

  • variantset-id: This was returned in the output of the create dataset command.
  • my-bigquery-dataset: This is the dataset ID you created in the prior step.
  • my-bigquery-table: This can be any ID you like such as “platinum_genomes_variants”.
gcloud alpha genomics variantsets export \
  variantset-id \
  my-bigquery-table \
  --bigquery-dataset my-bigquery-dataset
done: false
name: operation-id

Note operation-id, which you need in the next step.

For more detail, see variant exports.

Check the export operation for completion

As was described above for the import operation, be sure to check the export operation for completeness before proceeding.

Use BigQuery browser tools

This example is based on the Platinum Genomes public data. You can modify the query to use your BigQuery dataset.

  1. Open the BigQuery web UI.
  2. Click on "Compose Query".
  3. Copy and paste the following query into the dialog box and click on "Run Query".
#standardSQL
-- Count the number of records (variant and reference segments) we have in
-- the dataset and the total number of calls nested within those records.
--
-- The source data for table genomics-public-data:platinum_genomes.variants
-- was gVCF so a record can be a particular variant or a non-variant segment.
-- https://sites.google.com/site/gvcftools/home/about-gvcf
--
SELECT
  reference_name,
  COUNT(reference_name) AS num_records,
  SUM(ARRAY_LENGTH(call)) AS num_calls
FROM
  `genomics-public-data.platinum_genomes.variants` v
GROUP BY
  reference_name
ORDER BY
  reference_name

Now you can learn more about the variants table record structure, start querying your variants or review more examples.