Harnessing big genomic data to find the missing answers of Autism Spectrum Disorder

Scaling to Thousands of Genomes

As the MSSNG Project team was developing their roadmap to sequence the whole genomes of thousands of individuals, they quickly came to realize that the scale of data generated by a project of this size would exceed the capacity and capabilities of their usual partners. At 100 to 200 gigabytes per raw genome, the MSSNG project could easily surpass a petabyte of data.

“To manage this scale, we had to reach beyond academia and the life sciences. We had to forge a new collaboration with experts in storing, analyzing, and providing access to big data,” said Dr. Robert Ring, Chief Science Officer of Autism Speaks. “Connecting biological discoveries with Google expertise in extracting value from huge amounts of information will advance not only autism research, but the entire field of genomic medicine.”

Working through Google Genomics, the MSSNG Project has access to the same technologies that power Google Search and Maps. Using these technologies, the MSSNG team is creating solutions for securely storing, processing, exploring and sharing complex biological datasets. Autism Speaks has already uploaded nearly 100 terabytes of data from more than 1,300 genomes onto Google Cloud Storage and has an additional 2,000 samples in the sequencing queue. In the end, the MSSNG database will hold information from the whole genomes of 10,000 individuals, making it the world’s largest single repository of autism-related DNA sequencing data.

Enabling Open Science 

An important part of the MSSNG project is sharing these data with the global autism research community. Until now, the transport of genomic data between collaborators involved physically shipping hard drives, a costly and time-intensive process. The MSSNG database lets the autism community instantly power research projects, by providing web-based access to genomic data from thousands of individuals, together with new online analysis tools.

In January 2015, Nature Medicine published results from a MSSNG Project-led study that revealed new insights into the diversity of autism. The largest-ever autism genome study of its kind revealed that the disorder’s genetic underpinnings are even more complex than previously thought: Most siblings who have autism spectrum disorder have different autism-linked genes. The study’s de-identified data has been uploaded to the Google Cloud Platform and is being made available to scientists for global research.

"I am immensely excited because for the first time, any scientist anywhere in the world will be able to collaborate and perform analyses with these data in a ‘common cloud’,” said Dr. Stephen Scherer, Ph.D., D.Sc., FRSC, MSSNG Program Director. “T​hanks to Google Cloud Platform and our work with the Google Genomics team, this vast sea of information will be made accessible for free to researchers everywhere. This is an exemplar for a future when open-access genomics will lead to personalized treatments for many developmental and medical disorders.”

The MSSNG portal, which is built on Google Cloud Platform and Google Genomics, will allow qualified researchers to access the sequencing data using any modern web browser. Once logged on, researchers and bioinformaticians can query the data using tools such as BigQuery or use the Google Genomics API for batch analysis pipelines. Using Google Cloud Platform, the MSSNG Project will be prepared to manage the ever changing data query workload demands for this global project. 

Finding the Missing Answers in Autism 

Over the last five years, scientists have identified a number of rare gene changes, or mutations, associated with autism. A small number of these are sufficient to cause autism by themselves. Most cases of autism, however, appear to be caused by a combination of autism risk genes and environmental factors influencing early brain development. Researchers will use the MSSNG Project data to study and help answer some of the most vexing questions about the genotype-phenotype relationships in autism.

Each individual genome sequenced and stored in the MSSNG database will be associated with an array of detailed clinical information about the donor, which has been collected in a standardized way. This clinical information includes diagnoses and a rich diversity of related medical and research information. When combined with DNA sequencing data, researchers will be able to ask better questions and get faster answers about how genetic mutations lead to the development of autism and its many associated medical conditions.

“The insight and expertise the Google team has brought to the table in terms of innovative new ways to look at datasets this large has been unmatched,” said Dr. Ring. “Together, we hold the capability of accelerating breakthroughs in understanding the causes and subtypes of autism in ways that can advance diagnosis and treatment as never before. This is an incredibly important moment in autism genomic discovery, and we are poised to write the next chapter together.”