Storing and Loading Genomic Variants

This page describes how to copy and store raw VCF files in Cloud Storage and load variants into BigQuery for large-scale analysis.

Copying data into Cloud Storage

Cloud Genomics hosts a public dataset containing data from Illumina Platinum Genomes. To copy two VCF files from the dataset to your bucket:

gsutil cp \
    gs://genomics-public-data/platinum-genomes/vcf/NA1287*_S1.genome.vcf \
    gs://BUCKET/platinum-genomes/vcf/

Copying variants from a local file system

To copy a group of local files:

gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp *.vcf \
    gs://BUCKET/vcf/

To copy a local directory of files:

gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp -R \
    VCF_FILE_DIRECTORY/ \
    gs://BUCKET/vcf/

If any failures occur due to temporary network issues, you can re-run the previous commands using the no-clobber (-n) flag, which copies only the missing files:

gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp -n -R \
    VCF_FILE_DIRECTORY \
    gs://BUCKET/vcf/

For more information on copying data to Cloud Storage, see Using Cloud Storage with Big Data.

Loading and transforming VCF files into BigQuery

You can use the variant transforms pipeline to transform and load VCF files directly into BigQuery.

Using the pipeline, you can transform and load hundreds of thousands of files, millions of samples, and billions of records in a scalable manner.

The pipeline is based on Apache Beam and uses Cloud Dataflow.

Before you begin

To run the pipeline, you need:

Running the pipeline

You can run the pipeline using a Docker image that has all of the necessary binaries and dependencies installed.

To run the pipeline using a Docker image, complete the following steps:

  1. Run the following command to start the pipeline. Substitute the variables with the relevant resources from your GCP project.

    GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
    INPUT_PATTERN=gs://BUCKET/*.vcf
    OUTPUT_TABLE=GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET.BIGQUERY_TABLE
    TEMP_LOCATION=gs://BUCKET/temp
    COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq \
        --project ${GOOGLE_CLOUD_PROJECT} \
        --input_pattern ${INPUT_PATTERN} \
        --output_table ${OUTPUT_TABLE} \
        --temp_location ${TEMP_LOCATION} \
        --job_name vcf-to-bigquery \
        --runner DataflowRunner"
    gcloud alpha genomics pipelines run \
        --project "${GOOGLE_CLOUD_PROJECT}" \
        --logging "${TEMP_LOCATION}/runner_logs_$(date +%Y%m%d_%H%M%S).log" \
        --zones us-west1-b \
        --service-account-scopes https://www.googleapis.com/auth/cloud-platform \
        --docker-image gcr.io/gcp-variant-transforms/gcp-variant-transforms \
        --command-line "${COMMAND}"
    

    When specifying the location of your VCF files in a Cloud Storage bucket, you can specify a single file or use a wildcard (*) to load multiple files at once. Acceptable file formats include GZIP, BZIP, and VCF.

    Keep in mind that the pipeline runs more slowly for compressed files because compressed files cannot be sharded. If you want to merge samples across files, see the Variant Merging documentation.

    Note that the BUCKET/temp directory is used to store temporary files needed to run the pipeline.

  2. The command returns an operation ID in the format Running [OPERATION_ID]. You can use the operation ID to track the status of the pipeline by running the following command:

    gcloud alpha genomics operations describe OPERATION_ID
    

  3. The operations describe command returns done: true when the pipeline finishes. Depending on several factors, such as the size of your data, it can take anywhere from several minutes to an hour or more for the job to complete.

    You can run the following simple bash loop to check every 30 seconds whether the job is running, has finished, or returned an error:

    while [[ $(gcloud --format='value(done)' alpha genomics operations describe OPERATION_ID) != True ]]; do
        echo "Job still running, sleeping for 30 seconds..."
        sleep 30
    done
    

    Because the pipeline uses Cloud Dataflow, you can navigate to the Cloud Dataflow Console to see a detailed view of the job. For example, you can view the number of records processed, the number of workers, and detailed error logs.

  4. After the job completes, run the following command to list all of the tables in your dataset. Check that the new table containing your VCF data is in the list:

    bq ls --format=pretty GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET
    

    You can also view details about the table, such as the schema and when it was last modified:

    bq show --format=pretty GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET.BIGQUERY_TABLE
    

Next steps

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Genomics