Copying data into Cloud Storage
Cloud Genomics hosts a public dataset containing data from Illumina Platinum Genomes. To copy two VCF files from the dataset to your bucket:
gsutil cp \ gs://genomics-public-data/platinum-genomes/vcf/NA1287*_S1.genome.vcf \ gs://
Copying variants from a local file system
To copy a group of local files:
gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp *.vcf \ gs://
To copy a local directory of files:
gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp -R \
VCF_FILE_DIRECTORY/\ gs:// BUCKET/vcf/
If any failures occur due to temporary network issues, you can re-run the
previous commands using the no-clobber (
-n) flag, which copies only the
gsutil -m -o 'GSUtil:parallel_composite_upload_threshold=150M' cp -n -R \
VCF_FILE_DIRECTORY\ gs:// BUCKET/vcf/
For more information on copying data to Cloud Storage, see Using Cloud Storage with Big Data.
Loading and transforming VCF files into BigQuery
Using the pipeline, you can transform and load hundreds of thousands of files, millions of samples, and billions of records in a scalable manner.
Before you begin
To run the pipeline, you need:
A GCP project with billing and the Cloud Genomics, Compute Engine, Cloud Storage, and Cloud Dataflow APIs enabled.
An existing VCF file or a GZIP or BZIP file containing VCF files in a Cloud Storage bucket. If you don't have a VCF file, you can find one in the available public datasets.
Running the pipeline
You can run the pipeline using a Docker image that has all of the necessary binaries and dependencies installed.
To run the pipeline using a Docker image, complete the following steps:
Copy the following text and save it to a file named
vcf_to_bigquery.yaml. Substitute the variables with the relevant resources from your GCP project.
name: vcf-to-bigquery-pipeline docker: imageName: gcr.io/gcp-variant-transforms/gcp-variant-transforms cmd: | ./opt/gcp_variant_transforms/bin/vcf_to_bq \ --project
PROJECT_ID\ --input_pattern gs:// BUCKET/*.vcf \ --output_table PROJECT_ID: BIGQUERY_DATASET. BIGQUERY_TABLE\ --staging_location gs:// BUCKET/staging \ --temp_location gs:// BUCKET/temp \ --job_name vcf-to-bigquery \ --runner DataflowRunner
When specifying the location of your VCF files in a Cloud Storage bucket, you can specify a single file or use a wildcard (
*) to load multiple files at once. Acceptable file formats include GZIP, BZIP, and VCF.
Keep in mind that the pipeline runs more slowly for compressed files because compressed files cannot be sharded. If you want to merge samples across files, see the Variant Merging documentation.
Note that the
BUCKET_NAME/tempdirectories are used to store temporary files needed to run the pipeline.
Run the following command to start the pipeline:
gcloud alpha genomics pipelines run \ --project
PROJECT_ID\ --pipeline-file vcf_to_bigquery.yaml \ --logging gs:// BUCKET/temp/runner_logs \ --zones us-west1-b \ --service-account-scopes https://www.googleapis.com/auth/bigquery
The command returns an operation ID in the format
Running [operations/OPERATION_ID].You can use the operation ID to track the status of the pipeline by running the following command:
gcloud alpha genomics operations describe
operations describecommand returns
done: truewhen the pipeline finishes. Depending on several factors, such as the size of your data, it can take anywhere from several minutes to an hour or more for the job to complete.
You can run the following simple bash loop to check every 30 seconds whether the job is running, has finished, or returned an error:
while [[ $(gcloud --format='value(done)' alpha genomics operations describe
OPERATION_ID) != True ]]; do echo "Job still running, sleeping for 30 seconds..." sleep 30 done
Because the pipeline uses Cloud Dataflow, you can navigate to the Cloud Dataflow Console to see a detailed view of the job. For example, you can view the number of records processed, the number of workers, and detailed error logs.
After the job completes, run the following command to list all of the tables in your dataset. Check that the new table containing your VCF data is in the list:
bq ls --format=pretty
You can also view details about the table, such as the schema and when it was last modified:
bq show --format=pretty
PROJECT_ID: BIGQUERY_DATASET. BIGQUERY_TABLE
- Read through Analyzing variants using BigQuery to analyze the data you've loaded into BigQuery.
- Learn about the BigQuery variants schema.
- For information on the
Operationsresource and the information returned from the
gcloud alpha genomics operations describecommand, see the API documentation.
- Learn how to perform uploads in parallel for large local files using a parallel composite pattern.