Analyzing 3024 rice genomes characterized by DeepVariant
Developer Advocate, Digital Assets at Google Cloud
Research Engineer, Google Brain
Rice is an ideal candidate for study in genomics, not only because it’s one of the world’s most important food crops, but also because centuries of agricultural cross-breeding have created unique, geographically-induced differences. With the potential for global population growth and climate change to impact crop yields, the study of this genome has important social considerations.
This post explores how to identify and analyze different rice genome mutations with a tool called DeepVariant. To do this, we performed a re-analysis of the Rice 3K dataset and have made the data publicly available as part of the Google Cloud Public Dataset Program pre-publication and under the terms of the Toronto Statement.
We aim to show how AI can improve food security by accelerating genetic enhancement to increase rice crop yield. According to the Food and Agriculture Organization of the United Nations, crop improvements will reduce the negative impact of climate change and loss of arable land on rice yields, as well as support an estimated 25% increase in rice demand by 2030.
Why catalog genetic variation for rice on Google Cloud?
In March 2018, Google AI showed that deep convolutional neural networks can identify genetic variation in aligned DNA sequence data. This approach, called DeepVariant, outperforms existing methods on human data, and we showed that the approach to call variants on a human can be used to call variants on other animal species. This blog post demonstrates that DeepVariant is also effective at calling variants on a plant, thus demonstrating the effectiveness of deep neural network transfer learning in genomics.
In April 2018, three research institutions—the Chinese Academy of Agricultural Sciences (CAAS), the Beijing Genomics Institute (BGI) Shenzhen, and the International Rice Research Institute (IRRI)—published the results of a collaboration to sequence and characterize the genomic variation of the Rice 3K dataset, which consists of genomes from 3,024 varieties of rice from 89 countries. Variant calls used in this publication were identified against a Nipponbare reference genome using best practices and are available from the SNP-Seek database (Mansueto et al, 2017).
We recharacterized the genomic variation of the Rice 3K dataset with DeepVariant. Preliminary results indicate a larger number of variants discovered at a similar or lower error rate than those detected by conventional best practice, i.e. GATK.
In total the Rice3K DeepVariant dataset contains ~12 billion variants at ~74 million genomic locations (SNPs and Indels). These are available in a 1.5 terabyte (TB) table that uses the BigQuery Variants Schema.
Even at this size, you can still run interactive analyses, thanks to the scalable design of BigQuery. The queries we present below run on the order of a few seconds to a few minutes. Speed matters, because genomic data are often being interlinked with data generated by other precision agriculture technologies.
Illustrative queries and analyses
Below, we present some example queries and visualizations of how to query and analyze the Rice 3K dataset. Our analyses focus on two topics:
- The distribution of genome variant positions, across 3024 rice varieties.
- The distribution of allele frequencies across the rice genome.
For a step-by-step tutorial on how to work with variant data in BigQuery using the Rice 3K data or another variant dataset of your choosing, consider trying out the Analyzing variants with BigQuery codelab.
Analysis 1: Genetic variants are not uniformly distributed
Genomic locations with very high or very low levels of variation can indicate regions of the genome that are under unusually high or low selective pressure.
In the case of these rice varieties, high selective pressure (which corresponds to low genetic variation) indicates regions of the genome under high artificial selective pressure (i.e. domestication). Moreover, these regions contain genes responsible for traits that regulate important cultivational or nutritional properties of the plant.
We can measure the magnitude of the regional pressure by calculating at each position the Z statistic of each individual variety vs. all varieties. Here’s the query we used to produce the heatmap below, which shows the distribution of genetic variation across all 1Mbase-sized regions across all 12 chromosomes as columns (labeled by the top colored row), vs. all 3024 rice varieties as rows. Red indicates very low variant density relative to other samples within a particular genomic region, while pale yellow indicates very high variant density within a particular genomic region. The dendrogram below shows the similarity among samples (branch length) and groups similar rice varieties together:
A high resolution PDF of this plot is available, as well as the R script used to generate it.
Some interesting details of the dataset are highlighted (in yellow) in the heatmap above:
Closer inspection of chromosome 5 (cyan columns, 1Mbase blocks 9-12) shows that the distinct distribution of Z scores across samples likely occurs due to two factors:
this region includes many centromeric satellites resulting in a high false-positive rate of variants detected, and
a genomic introgression present in some of the rice varieties multiplies this effect (yellow rows).
Nearly all of the 3024 rice varieties included in the Rice 3K dataset are from rice species Oryza sativa. However, 5 Oryza glaberrima varieties were also included. These have a high level of detected genetic variation because they are from a different species, and are revealed as a bright yellow band at the top of the heatmap.
The majority of samples can be partitioned into one group with high variant density and another group with low variant density. This partition fits with previously used methods for classification by admixture. For example, the bottom rows that are mostly red correspond to rice varieties in the japonica and circum-basmati (aromatic) groups that are similar to the Nipponbare reference genome we used.
Analysis 2: Some specific regions are under selective pressure
According to the Hardy-Weinberg Principle, the expected proportion of genotype frequencies within a randomly mating population, in the absence of selective evolutionary pressure, can be calculated from the component allele frequencies. For a bi-allelic position having alleles P and Q and corresponding population frequencies p and q, the expected genotype proportions for PP, PQ, and QQ can be calculated with the formula p2 + 2pq + q2 = 1. However we need to modify this formula by adding an inbreeding coefficient F to reflect the population structure (see: Wahlund effect) and the self-pollination tendency of rice: PP=p2+Fpq ; PQ=2(1-F)pq ; QQ=q2+Fpq where F=0.95.
The significance of genomic positions deviating from the expected genotype distribution follows χ2 , allowing a p-value to be derived and thus identification of positions that are either under significant selective pressure or neutral. In short, this analysis, highlights the fact that rice is highly inbred.
Below you can find a plot of 10-kilobase genome regions from the Oryza sativa genome, colored according to the proportion of variant positions that are significantly (p<0.05) out of (inbreeding modified) Hardy-Weinberg equilibrium, with white regions corresponding to those under low selective pressure and red regions corresponding to those under high selective pressure:
The data shown above were retrieved using this query and plotted using this R script. The query used to make this figure was adapted to the BigQuery Variants Schema from one of a number of quality control metrics found in the Google Genomics Cookbook.
Note that selective pressure on the genome is not uniformly distributed, indicated by the clumps of red visible in the plot. Interestingly, there is little correspondence between the prevalence of variants within a region (previous figure) and the proportion of variants within that same region that are under significant selective pressure. The bin size (10 kilobases) used in this visualization is on the order of the average Oryza sativa gene size (3 kilobases) and, given the low correlation between high selective pressure and variant density, it may be useful to guide a gene hunting expedition aimed at identifying genomic loci associated with phenotypes of interest (i.e. those that affect caloric areal yield, nutritive value, and drought- and pest-resistance).
Data availability and conclusion
Genome sequencer reads in FastQ format from Sequence Read Archive Project PRJEB6180, were aligned to the Oryza sativa Os-Nipponbare-Reference-IRGSP-1.0 reference genome using the Burrow-Wheeler Aligner (BWA), producing a set of aligned read files in BAM format.
Subsequently, the BAM files were processed with the Cloud DeepVariant Pipeline, a Cloud TPU-enabled, managed service that executes the DeepVariant open-source software. The pipeline produced a list of variants detected in the aligned reads, and these variants were written out to storage as a set of variant call files in VCF format.
Finally, all VCF files were processed with the Variant Transforms Cloud Dataflow Pipeline, which wrote records to a BigQuery Public Dataset table in the BigQuery Variants Schema format.
For additional guidance on how to use DeepVariant and BigQuery to analyze your own data on Google Cloud, please check out the following resources:
We’d like to thank our collaborators and their organizations—both within and outside Google—for making this post possible:
Allen Day, Google Cloud
Ryan Poplin, Google AI
Ken McNally, IRRI
Dmytro Chebotarov, IRRI
Ramil Mauleon, IRRI