Store raw VCF files in Cloud Storage

This page describes how to copy and store raw VCF files in Cloud Storage. After storing raw VCF files, you can use the Variant Transforms tool to load them into BigQuery.

Copy data into Cloud Storage

Cloud Life Sciences hosts a public dataset containing data from Illumina Platinum Genomes. To copy two VCF files from the dataset to your bucket, use the gsutil cp command:

gsutil cp \
    gs://genomics-public-data/platinum-genomes/vcf/NA1287*_S1.genome.vcf \
    gs://BUCKET/platinum-genomes/vcf/

Replace BUCKET with the name of your Cloud Storage bucket.

Copying variants from a local file system

To copy a group of local files in your current directory, run the gsutil cp command:

gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp *.vcf \
    gs://BUCKET/vcf/

Replace BUCKET with the name of your Cloud Storage bucket.

To copy a local directory of files, run the following command:

gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp -R \
    VCF_FILE_DIRECTORY/ \
    gs://BUCKET/vcf/

Replace the following:

  • VCF_FILE_DIRECTORY: the path to the local directory containing VCF files
  • BUCKET: the name of your Cloud Storage bucket

If any failures occur due to temporary network issues, you can re-run the previous commands using the no-clobber (-n) flag, which copies only the missing files:

gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp -n -R \
    VCF_FILE_DIRECTORY/ \
    gs://BUCKET/vcf/

Replace the following:

  • VCF_FILE_DIRECTORY: the path to the local directory containing VCF files
  • BUCKET: the name of your Cloud Storage bucket

For more information on copying data to Cloud Storage, see Using Cloud Storage with Big Data.