Storing and Loading Genomic Variants

This page describes how to copy and store raw VCF files in Cloud Storage and load variants into BigQuery for large-scale analysis.

Copying data into Cloud Storage

Cloud Genomics hosts a public dataset containing data from Illumina Platinum Genomes. To copy two VCF files from the dataset to your bucket:

gsutil cp \
    gs://genomics-public-data/platinum-genomes/vcf/NA1287*_S1.genome.vcf \
    gs://BUCKET/platinum-genomes/vcf/

Copying variants from a local file system

To copy a group of local files:

gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp *.vcf \
    gs://BUCKET/vcf/

To copy a local directory of files:

gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp -R \
    VCF_FILE_DIRECTORY/ \
    gs://BUCKET/vcf/

If any failures occur due to temporary network issues, you can re-run the previous commands using the no-clobber (-n) flag, which copies only the missing files:

gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp -n -R \
    VCF_FILE_DIRECTORY \
    gs://BUCKET/vcf/

For more information on copying data to Cloud Storage, see Using Cloud Storage with Big Data.

Loading and transforming VCF files into BigQuery

You can use the variant transforms pipeline to transform and load VCF files directly into BigQuery.

Using the pipeline, you can transform and load hundreds of thousands of files, millions of samples, and billions of records in a scalable manner.

The pipeline is based on Apache Beam and uses Cloud Dataflow.

Before you begin

To run the pipeline, you need:

Running the pipeline

You can run the pipeline using a Docker image that has all of the necessary binaries and dependencies installed.

To run the pipeline using a Docker image, complete the following steps:

  1. Copy the following text and save it to a file named vcf_to_bigquery.yaml. Substitute the variables with the relevant resources from your GCP project.

    name: vcf-to-bigquery-pipeline
    docker:
      imageName: gcr.io/gcp-variant-transforms/gcp-variant-transforms
      cmd: |
        ./opt/gcp_variant_transforms/bin/vcf_to_bq \
          --project PROJECT_ID \
          --input_pattern gs://BUCKET/*.vcf \
          --output_table PROJECT_ID:BIGQUERY_DATASET.BIGQUERY_TABLE \
          --staging_location gs://BUCKET/staging \
          --temp_location gs://BUCKET/temp \
          --job_name vcf-to-bigquery \
          --runner DataflowRunner
    

    When specifying the location of your VCF files in a Cloud Storage bucket, you can specify a single file or use a wildcard (*) to load multiple files at once. Acceptable file formats include GZIP, BZIP, and VCF.

    Keep in mind that the pipeline runs more slowly for compressed files because compressed files cannot be sharded. If you want to merge samples across files, see the Variant Merging documentation.

    Note that the BUCKET_NAME/staging and BUCKET_NAME/temp directories are used to store temporary files needed to run the pipeline.

  2. Run the following command to start the pipeline:

    gcloud alpha genomics pipelines run \
        --project PROJECT_ID \
        --pipeline-file vcf_to_bigquery.yaml \
        --logging gs://BUCKET/temp/runner_logs \
        --zones us-west1-b \
        --service-account-scopes https://www.googleapis.com/auth/bigquery
    

  3. The command returns an operation ID in the format Running [operations/OPERATION_ID]. You can use the operation ID to track the status of the pipeline by running the following command:

    gcloud alpha genomics operations describe OPERATION_ID
    

  4. The operations describe command returns done: true when the pipeline finishes. Depending on several factors, such as the size of your data, it can take anywhere from several minutes to an hour or more for the job to complete.

    You can run the following simple bash loop to check every 30 seconds whether the job is running, has finished, or returned an error:

    while [[ $(gcloud --format='value(done)' alpha genomics operations describe OPERATION_ID) != True ]]; do
        echo "Job still running, sleeping for 30 seconds..."
        sleep 30
    done
    

    Because the pipeline uses Cloud Dataflow, you can navigate to the Cloud Dataflow Console to see a detailed view of the job. For example, you can view the number of records processed, the number of workers, and detailed error logs.

  5. After the job completes, run the following command to list all of the tables in your dataset. Check that the new table containing your VCF data is in the list:

    bq ls --format=pretty PROJECT_ID:BIGQUERY_DATASET
    

    You can also view details about the table, such as the schema and when it was last modified:

    bq show --format=pretty PROJECT_ID:BIGQUERY_DATASET.BIGQUERY_TABLE
    

Next steps

Send feedback about...

Cloud Genomics