Color: Enabling scientists to do real-time genomic data exploration

About Color

Color, founded in 2013, is a population health service powered by genomics. Color's clinical-grade sequencing, software, and analytics platform makes it easy and affordable for people around the world to access their genetic information to learn about their hereditary risk for cancer and heart disease, and how their genes may influence how they process certain medications.

Industries: Healthcare, Life Sciences
Location: United States

Google Cloud Platform enables Color scientists to stay focused on their work by analyzing huge volumes of genomic data in seconds rather than hours.

Google Cloud Results

  • Loads massive amounts of genomic data automatically
  • Helps data scientists stay focused on important work
  • Frees engineers from managing and scaling database infrastructure

Queries billions of rows of data in seconds

Color provides individuals with an assessment of their risk for common hereditary cancers and hereditary heart conditions, as well as insights about how their genes may influence how they process certain medications. Priced at $249, Color is physician-ordered and is within financial reach of many people — especially compared to the thousands of dollars that DNA tests could cost. The process can be conveniently started at home using a saliva collection kit.

The results the testing provides can be potentially lifesaving when considered by a physician as part of a personal screening and prevention plan.

One example is Chris, a retired teacher who had uterine cancer in her family history. Though she showed no signs of cancer, her gynecologist ordered the Color test as a precautionary measure and her results showed she was at higher risk for certain kinds of hereditary cancer. Based on those results, Chris chose to have prophylactic surgery, which revealed she had early-stage cancer – a condition that without the surgery might not have been detected until it had reached stage 3. Chris credits her doctor and Color for saving her life.

"Google sees its job as identifying difficult technical problems and then helping organizations overcome them, so they can get things done and break new ground. Google is unlike other cloud vendors in that regard, and that was very important to us."

Ryan Barrett, Software Engineer, Color

Achieving those powerful insights from raw genomic data requires marrying huge computing power with sophisticated algorithms and genetic analysis. Color had been using a simple Postgres open-source database for customer phenotype, health history, and other reports, as well as tools such as open-source Metabase to gain insights from the de-identified data. But Color engineers didn't have a satisfactory solution for heavy data mining of genetic variants, which was crucial for expanding its tests, understanding clients' DNA better, and delivering more insights.

To meet these challenges, Color turned to Google Cloud. "Google sees its job as identifying difficult technical problems and then helping organizations overcome them, so they can get things done and break new ground," says Ryan Barrett, Software Engineer for Color. "Google is unlike other cloud vendors in that regard, and that was very important to us."

Google Cloud Platform (GCP) services, especially BigQuery, offer the solution that Color engineers sought. "BigQuery is the only data warehouse we've seen with a query engine that’' robust and mature enough to easily import billions of rows of data and then allows us to analyze the data in seconds," says Ryan.

Along with BigQuery, Color now leverages Variant Transforms to load massive amounts of de-identified genomic data directly into BigQuery; Cloud Storage for storing de-identified data ingested from Variant Transforms; and Cloud Dataproc to process streaming and batch de-identified data in combination with BigQuery and to automatically manage compute resources needed to support ad-hoc data analysis in BigQuery.

Saving time by not having to manually scale

BigQuery is a fully managed cloud data warehouse and analytics engine that easily scales to accommodate the growing volume of genomic data Color uses to provide insights to its customers.

"We have billions of rows of data today and we're looking at growing to trillions," Ryan says. "Scaling manually like this to increase querying speed and accommodate larger data sets would be a full-time job for at least one, maybe two, engineers. But BigQuery handles all that for us and much more: speeds up our product development; improves our academic research; and frees up team members to focus on more important development tasks. BigQuery was the only cloud data warehouse we looked at that didn't require us to handle the scaling ourselves."

"BigQuery has it all — an easy-to-use interface, an ecosystem of APIs, the ability to scale to support our data, and the tools we need to query data in seconds."

Ryan Barrett, Software Engineer, Color

BigQuery is also behind a new public database, Color Data, that Color unveiled earlier this year. Color Data is the largest public database of de-identified, aggregated clinical and genetic information, culled from 50,000 Color test users who have consented to be included. The interactive database enables users to query BigQuery in real time with a variety of filters.

"BigQuery has it all — an easy-to-use interface, an ecosystem of APIs, the ability to scale to support our data, and the tools we need to query data in seconds," Ryan adds.

Automatically converting data into genomic file formats

Variant Transforms is an open source tool developed by the Google Cloud Healthcare & Life Sciences team to make it easier for bioinformaticians to use tools like BigQuery to analyze genomic data by automating the transformation and extraction of variants directly from VCF files into BigQuery. Color leverages Variant Transforms in its bioinformatics pipeline to ingest massive amounts of genomic data directly into BigQuery. "Initially, I had expected to convert and transform the data myself," Ryan says. "But Variant Transforms requires very little effort on our part to transform the data into the genomic data file formats needed to run analyses."

Color engineers began working with Variant Transforms as soon as it became available. "Variant Transforms did exactly what I needed it to do," Ryan says. "It imported our complete, large dataset into BigQuery with ease."

Google engineers saw that Color was an early user of Variant Transforms and jumped in to help. "There are always growing pains, especially with new software," Ryan adds. "Google was hugely supportive from the beginning, especially the healthcare and life sciences team, who went above and beyond to help us use Variant Transforms. They clearly care and truly make the effort to make sure everything is working for us."

"By pointing Cloud Dataproc at BigQuery, we enable our scientists to do real-time data exploration. From massive data volumes, our scientists can ask questions, hunt for patterns, and explore potential outcomes within seconds or minutes — tasks that would otherwise take up to 12 hours."

Ryan Barrett, Software Engineer, Color

Enabling real-time exploration

Cloud Dataproc processes stream and batch data for Color and works in combination with BigQuery, automatically managing computer resources needed to support ad-hoc data analysis in BigQuery.

"By pointing Cloud Dataproc at BigQuery, we enable our scientists to do real-time data exploration," Ryan explains. "From massive data volumes, our scientists can ask questions, hunt for patterns, and explore potential outcomes within seconds or minutes — tasks that would otherwise take up to 12 hours. That's a hard, technical problem to solve, but GCP solves it without us having to worry about managing everything or writing extra code."

Ryan concludes: "When you have to wait for results from queries, it's easy to lose your focus and get distracted by emails or other activities. Real-time exploration keeps scientists in a continual flow state, focused on achieving insights — a powerful shift in how we work."

About Color

Color, founded in 2013, is a population health service powered by genomics. Color's clinical-grade sequencing, software, and analytics platform makes it easy and affordable for people around the world to access their genetic information to learn about their hereditary risk for cancer and heart disease, and how their genes may influence how they process certain medications.

Industries: Healthcare, Life Sciences
Location: United States
Google Cloud Platform logo

12 Months FREE TRIAL

Try Kubernetes Engine, BigQuery, and other Cloud Platform products with $300 in free credit and 12 months.

TRY IT FREE
Google Cloud Platform logo

12 Months FREE TRIAL

Try Kubernetes Engine, BigQuery, and other Cloud Platform products with $300 in free credit and 12 months.

TRY IT FREE