Google Cloud Platform
Genomic ancestry inference with deep learning
For the past several years, our goal at Google has been to play a critical role in bringing the benefits of AI to everyone. Machine learning is at the heart of that goal. In the area of life sciences — or more specifically, the field of genomics — we’re using ML to derive insights from the human genome. Additionally, due to the scale of human genomic data, we need new techniques to process the datasets using machine learning and cloud computing.
In this blog post, we’ll delve into ancestry prediction, one use case for genomics data. We’ll examine an analytical technique for verifying the ancestry of a human DNA sample, and show you how to implement it as a system using Google Cloud Platform, TensorFlow and Google Cloud Machine Learning Engine.
Why is ancestry inference important?A common problem when analyzing genomic data is ensuring the integrity and quality of samples. Although quality control metrics exist to ensure sufficient DNA has been collected and is not degraded, sample labeling can still be a major source of error. Ancestry inference is one tool in a collection of quality control tests used to verify whether sample swaps or data contamination have occurred. Applying inference as a technique is sometimes used as a checksum to verify a sample’s integrity. For case-control studies, ancestry inference is also used to understand population stratification across the cohort to help avoid spurious associations that might appear with even subtle ancestry differences between cases and controls.In this blog post, we’ll show you how to build a model for ancestry inference using the 1000 Genomes Project's labels, the most detailed public catalog of human genetic variation. We’ll then walk through how we built an end-to-end machine learning pipeline architecture. In the process, we’ll demonstrate the reference architecture of such a system and why we think Google Cloud Platform is a powerful way to deliver that architecture.
1000 Genomes datasetThe 1000 Genomes dataset has already been processed and uploaded to BigQuery table named 1000 Genomes phase 3. In this section, we’ll describe the processed dataset, and in a later section we’ll process the raw Variant Call Format (VCF) data using Google Genomics API, our implementation of the open standard from Global Alliance for Genomics and Health. The variants data in BigQuery has the following statistics:
|Table Size||10.5 TB|
|Long Term Storage Size||10.5 TB|
|Number of Rows||84,801,867|
|Creation Time||Jan 8, 2016, 6:53:36 PM|
|Last Modified||Jan 15, 2016, 1:35:26 PM|
Figure 1: 1000 Genomes phase 3 variants statistics
Evaluation datasetOnce the model has been trained using 1000 Genomes, we’ll evaluate the model performance using the following data set:
Figure 2: 1000 Genomes labels
COUNT(Super_Population) AS cnt
ORDER BY cnt DESC
Figure 3: SQL BigQuery to get information about ancestry labels
Model building approachIn order to infer the ancestry of a given sample, we first need to train a machine learning model using an algorithm. We used TensorFlow — an open source library for numerical computation using data flow graphs — to build and define the model, and the code to build this model can be found on GitHub. We treat the variants on each chromosome as a bag of words, where each word corresponds to a variant, defined as a tuple with five elements as (reference_name,start,end,reference_bases,alternate_bases) the set of all variants present on the dataset for a given chromosome. This defines a sparse space in multiple dimensions where the feature vector for each subject's variants per chromosome lives. Training a model on sparse vectors presents a challenge because we won’t have enough data to train. To solve this problem, we’ll convert the input vector into a lower dimensional dense embedding, resulting in an output which has non-zero values in the lower dimensional representation. We feed this output into a feed-forward neural network with a single hidden relu layer and a fully connected output layer. To train this model, we need to optimize a loss function, which is computed using cross entropy by applying Momentum Optimizer. In the figure below you can see the neural network architecture currently employed.
Figure 4: Neural network model architecture
Reference architectureThe end-to-end architecture uses Google Cloud Platform services and makes it easy to manage a very complex workflow.
Figure 5: End-to-end processing using Google Cloud Platform services
Transform data into actionsIn this section we’ll take you through the steps needed to deliver the machine learning model. You can replicate these steps and achieve the same end goal for your problem.
Data ingestionData ingestion involves taking your raw dataset and ingesting into Google Cloud Platform. You can do this in a few different ways using Google Cloud Bigtable, Google Cloud Storage or Google Drive, but for the purposes of this blog post, we’ll show you how to ingest raw VCF data into Google Cloud Storage. You can access raw VCF data from the 1000 Genomes FTP and copy the files to Google Cloud Storage using gsutil. Alternatively, you could use the Cloud Storage Transfer Service for different ingestion options. To simplify this walkthrough, we’ve already copied the raw VCF data to Google Cloud Storage.
Data preparationNext, we take the raw VCF dataset in Google Cloud Storage and process it through Google Genomics API. We’ll be performing two steps in this phase. The first will be to import the variants into Google Genomics, and the second will be to export the variants to BigQuery. The steps you’ll need to take to achieve this can be found in the Loading Genomic Variants guide. The output of this phase will be the variants dataset in BigQuery, which we have already pre-computed.
Pre-computed data and model — Requires quota increaseIn order to perform the data pre-processing and model training in the next few steps, you will need a quota increase. To save you compute and storage costs, we’ve already processed the data for you for this tutorial. The processed datasets and model locations are listed in the table below.
Important Note: The model size exceeds the 250MB limit placed by the prediction service. If you decide to send requests from your project, make sure to request a quota increase up to 0.5GB. To submit your model size request, click on the quota link, select any of the quota edit (pencil logo) links, apply for higher quota, and submit your form request.
|Dataset/Model||Cloud Storage / BigQuery Location|
|1000 Genomes processed||gs://human-genome/1000-genomes/20170819-142122|
|Cloud ML Engine model||gs://human-genome/models/1000-genomes|
|SGDP batch predictions||gs://human-genome/predictions/sgdp|
|SGDP BigQuery predictions||https://bigquery.cloud.google.com/table/sgdp_predictions|
Table 1: Google Cloud Storage locations for processed data and model
Figure 6: Dataflow DAG visualizer on Google Cloud which shows the Pipeline
Storage of pre-processed dataOnce the two Dataflow jobs are completed, make sure you have the data available in Google Cloud Storage. The data will be available in the location specified by the output flag in the code here. It’s this pre-processed data which we will use for training our ML model on Cloud ML Engine in the following steps.
Model training using Cloud ML EngineCloud Machine Learning Engine is a managed service that enables you to easily build machine learning models that work on any type of data and size. We will now take our TensorFlow code, as described in the model building approach section, and train the model in the cloud using Cloud ML Engine. Take a look at the code on GitHub.
Hyperparameter tuningHyperparameter tuning is a critical aspect of model architecture design, and Cloud ML Engine provides out of the box support for it as a service called Hypertune. The sophistication of Hypertune, compared to traditional methods of hyperparameter tuning, makes it particularly helpful. We will use the Hypertune service to determine the model’s learning rate. In this particular experiment setup, we ran 10 experiments with Hypertune to discover the ideal learning rate. You can check out the output of the Hypertune job on cloud console and identify the most optimized run. Take a look at the code on GitHub.
TensorBoardYou can inspect the training behavior using a powerful tool called TensorBoard. It allows you to visualize the graph and plot quantitative metrics about the execution of the graph. Once the training process is complete, you can point TensorBoard to the summary logs produced during training as described on GitHub. Check out the images below from TensorBoard to see the network performance based on 20% of the 1000 Genomes data held out for evaluation.
Prediction using the trained modelIf you simply want to create the prediction service and try it out, perform the following steps in your project and make sure you’ve requested increased quota as mentioned above.
BINARIES=<YOUR CLOUD STORAGE LOCATION WHERE YOU COPIED THE MODEL TO>
gcloud ml-engine models create $MODEL --regions us-central1
gcloud ml-engine versions create v1 --model $MODEL --origin $BINARIES --runtime-version 1.2
from google.datalab.ml import *
Figure 7: Confusion matrix
The confusion matrix above has 10 samples that don’t match their target and can be seen in the table below. These predictions are for sub populations at the boundaries of the super population groupings. For more information, please see Simons Genome Diversity Project publication.
Table 2: Mismatched samples with super population probabilities