Loading Genomic Variants

Google Cloud provides a way to store and work with genomic variants. This tutorial shows you how to do the first steps:

  1. Store raw VCF files in Google Cloud Storage
  2. Load variants to Google BigQuery for large scale analyses.

Before you begin

  1. Complete the Quickstart.
  2. To get the command-line client, install the Cloud SDK and Genomics commands.
  3. Enable billing for your project:
    1. Open the billing page for the project you have selected or created.
    2. Click Enable billing.

    Note: Enabling billing does not necessarily mean you will be charged. See Pricing for more information.

Step 1: Create a Google Cloud Storage bucket

If you do not already have a Google Cloud Storage bucket, you can add one through the Google Cloud Platform Console or using the gsutil tool from the Cloud SDK.

Step 2: Upload variants to Google Cloud Storage

Transfer the data

For this tutorial, we have available some uncompressed VCF files from the Illumina Platinum Genomes dataset. You can also import your own variant files. VCFs for importing to Google Genomics must be stored uncompressed in Cloud Storage

The following commands demonstrate a few different ways of loading data.

To transfer 3 samples from our Genomic public data:

gsutil cp \
    gs://genomics-public-data/platinum-genomes/vcf/NA1287*_S1.genome.vcf \

To transfer a group of local files using a grouping pattern:

gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp *.vcf \

To transfer a local directory tree of files:

gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp -R \
    my-vcf-directory \

If any failures occur due to temporary network issues, re-run with the no-clobber flag to transmit just the missing files:

gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp -n -R \
    my-vcf-directory \

For more information on transferring data to Cloud Storage, see Using Google Cloud Storage with Big Data.

Check the data

If you have copied the Platinum Genomes samples, then when you are done, the bucket should contain:

gsutil ls gs://my-bucket/platinum-genomes/vcf

For more information on listing bucket contents, see the gsutil ls command.

Step 3: Load variants to Google BigQuery

Create a BigQuery dataset in the web UI to hold the data

  1. Open the BigQuery web UI.
  2. Click the down arrow icon down-arrow next to your project name in the navigation, then click Create new dataset.
  3. Input a dataset ID.

Launch the VCF to BigQuery pipeline

The Variant Transforms tool can be used to load VCF files directly to BigQuery. We'll use the Genomics Pipelines API to launch the pipeline. First, set up the pipeline configurations shown below and save it as vcf_to_bigquery.yaml.

  • my-project: This is your project name that contains the BigQuery dataset.
  • my-bigquery-dataset: This is the dataset ID you created in the prior step.
  • my-bigquery-table: This can be any ID you like such as “platinum_genomes_variants”.
name: vcf-to-bigquery-pipeline
  imageName: gcr.io/gcp-variant-transforms/gcp-variant-transforms
  cmd: |
    ./opt/gcp_variant_transforms/bin/vcf_to_bq \
      --project my-project \
      --input_pattern gs://my-bucket/platinum-genomes/vcf/*.vcf \
      --output_table my-project:my-bigquery-dataset.my-bigquery-table \
      --staging_location gs://my-bucket/staging \
      --temp_location gs://my-bucket/temp \
      --job_name vcf-to-bigquery \
      --variant_merge_strategy MOVE_TO_CALLS \
      --runner DataflowRunner

Next, run the following command to launch the pipeline:

gcloud alpha genomics pipelines run \
    --project my-project \
    --pipeline-file vcf_to_bigquery.yaml \
    --logging gs://my-bucket/temp/runner_logs \
    --zones us-west1-b \
    --service-account-scopes https://www.googleapis.com/auth/bigquery

The output is:

Running [operations/operation-id].

Note operation-id, which you need in the next step.

Check the operation for completion

  • operation-id: This was returned in the output of the prior command.

Checking operation details

An operation can be examined with the command:

gcloud alpha genomics operations describe operation-id

A detailed description of the Operation resource can be found in the API documentation.

The underlying pipeline uses Cloud Dataflow. You can navigate to the Dataflow Console, to see more detailed view of the pipeline (e.g. number of records being processed, number of workers, more detailed error logs).

Polling operation for completion

To check whether an operation is completed, you can explicitly check the done status:

gcloud --format='default(error,done)' alpha genomics operations describe operation-id

To poll for completion of an operation, you can write a simple loop in bash:

while [[ $(gcloud --format='value(done)' alpha genomics operations describe "${OP_ID}") != True ]]; do
  echo "Sleeping for 30 seconds"
  sleep 30
gcloud --format='default(name,error,done)' alpha genomics operations describe "${OP_ID}"

The loop will exit when the operation has completed and the subsequent command will output the operation including errors, if any.

Verify operation success

When an operation completes:

  • The done field will be changed from false to true.
  • If the operation has failed, the error field will be set.
  • If the operation has succeeded, some operation types will set the result field as described in the API documentation.

Use BigQuery browser tools

This example is based on the Platinum Genomes public data. You can modify the query to use your BigQuery dataset.

  1. Open the BigQuery web UI.
  2. Click on "Compose Query".
  3. Copy and paste the following query into the dialog box and click on "Run Query".
-- Count the number of records (variant and reference segments) we have in
-- the dataset and the total number of calls nested within those records.
-- The source data for table genomics-public-data:platinum_genomes.variants
-- was gVCF so a record can be a particular variant or a non-variant segment.
-- https://sites.google.com/site/gvcftools/home/about-gvcf
  COUNT(reference_name) AS num_records,
  SUM(ARRAY_LENGTH(call)) AS num_calls
  `genomics-public-data.platinum_genomes.variants` v

Now you can learn more about the variants table record structure, start querying your variants or review more examples.

Send feedback about...

Cloud Genomics