Public Datasets

Better data for better AI: new speech datasets and benchmarks for data

December 14, 2021

https://storage.googleapis.com/gweb-cloudblog-publish/images/18_-_Infrastructure_urFj7Af.max-2600x2600.jpg

Peter Mattson

Staff Engineer

AI researchers and engineers need better data to enable better AI solutions. The quality of an AI solution is determined by both the learning algorithm (such as a deep-neural network model) and the datasets used to train and evaluate that algorithm. Historically, AI research has focused much more on algorithms than datasets, despite their vital importance. As a result, many algorithms are freely available as starting points, but many important problems lack large, high-quality open datasets. Further, creating new datasets is expensive and error-prone.

Recently, the data-centric AI movement has emerged, which aims to develop new methodologies and tools for constructing better datasets to fix this problem. Conferences, workshops, challenges, and platforms are being launched to support improving data quality and to foster data excellence. Thought leaders such as Andrew Ng at Landing.AI and Chris Re at Stanford University are encouraging AI developers to focus more on iterative data engineering than they do tuning their learning algorithms. Our CHI-best-paper-award-winning paper, “Everyone wants to do the model work, not the data work” highlighted the significance of data quality in the practice of ML.

At Google, we are excited to contribute to data-centric AI. Today, Google Cloud is adding a new high value dataset to the Public Dataset Program, and Google researchers are announcing DataPerf, a new multi-organizational effort to develop benchmarks for data quality and data centric algorithms.

Google Cloud is committed to helping users improve their data quality, starting with supporting better public data. The Public Datasets program provides high quality datasets pre-configured on GCP for easy access. Google Cloud is adding a new high-value dataset developed by the MLCommons™ Association (which Google co-founded) to the Public Datasets program: The Multilingual Spoken Words Corpus: a rich audio speech dataset with more than 340,000 keywords in 50 languages with upwards of 23.4 million examples.

This new public dataset is aligned with the MLCommons Association vision for “open” datasets – accessible by all – that are “living” – continually being improved to raise quality and increase representation and diversity.

Google researchers, in collaboration with multiple organizations, are announcing the DataPerf effort at the NeurIPS Data-Centric AI workshop today, to develop benchmarks to improve data quality. Much like the the MLPerf™ benchmarking effort which is now the industry standard for machine learning hardware/software speed, DataPerf brings together the originators of prior efforts including: CATS4ML, Data-Centric AI Competition, DCBench, Dynabench, and the MLPerf benchmarks to define clear metrics that catalyze rapid innovation. DataPerf will measure the utility of training and test data for common problems, and algorithms for working with datasets such as: selecting core sets, correcting errors, identifying under-optimized data slices, and valuing datasets prior to labeling.

Together, supporting open, living datasets for core ML tasks, and the development of benchmarks to direct the rapid evolution of those datasets will empower the researchers and engineers who use Google Cloud to do even more amazing things – and we can’t wait to see what they create!

Acknowledgements: In collaboration with Lora Aroyo and Praveen Paritosh.

Data Analytics