Jump to Content
AI & Machine Learning

Bringing the power of large models to Google Cloud’s Speech API

May 19, 2023
Calum Barnes

Product Manager, Cloud Speech

As voice becomes an increasingly popular touchpoint between businesses and customers, our Speech-to-Text (STT) API has been one of the fastest growing APIs from Google Cloud. Google Cloud’s Speech API processes more than 1 billion voice minutes per month for our enterprise customers, across a range of industries, with near-human levels of understanding for many commonly spoken languages.

Many companies are using speech services from Google Cloud to power next-generation products and customer experiences. HubSpot is  using STT for their Conversational Intelligence tools, MRV uses the API to reduce customer service time by a third, and Spotify is leveraging STT for their voice interface, Car Thing. 

Our goal is to provide users with the highest possible quality speech recognition for their use case. At Google Cloud, we continue to partner with our colleagues in Google Research and beyond to push quality and new types of models. Today, that means we’re bringing the power of large models to our Speech API and into the hands of developers.

In March of this year, Google published research on progress towards a Universal Speech Model. Last week at Google I/O, we announced that we are bringing a new version of the Universal Speech Model, Chirp, to Cloud. Chirp will serve as a foundation model for Speech AI in Google Cloud. Today we are excited to dive deeper into how we are now applying the power of large models to our Speech API with Chirp.  

Chirp is Google Cloud's 2B-parameter speech model built via self-supervised training on millions of hours of audio and 28 billion sentences of text spanning 100+ languages.  Chirp delivers 98% speech recognition accuracy in English and over 300% relative improvement in several languages with less than 10M speakers. 

Chirp is not only larger than previous speech models, but also incorporates new training approaches. Chirp’s encoder was first trained with millions of hours of unsupervised (i.e., unlabeled) audio data from 100+ languages. The model was then fine-tuned for transcription in each specific language with small amounts of supervised data. This contrasts with traditional speech recognition techniques that focus on large amounts of language-specific supervised data. These techniques help Chirp to achieve such large quality improvements in languages and accents with very few speakers and small amounts of labeled training data.  By adding Chirp to Cloud, we are thrilled to bring the quality of speech recognition for more languages and accents closer to that of the most widely spoken languages. 

In collaboration with the Internet Archive's TV News Archive, the GDELT Project is applying Google Cloud’s Speech-to-Text and Translation APIs to transcribe and translate global television news from across the world, enabling researchers and journalists to understand and cite local events from local sources across a wide range of languages and dialects. “Television news is a major source of information for societies around the world, but the lack of searchable and translatable transcripts has largely rendered it inaccessible. Through the combination of Speech-to-Text and Translation AI from Google Cloud, GDELT to date has transcribed and translated more than 66,000 broadcasts totaling more than 328 million words. With the release of Google's new Chirp speech model, we are now able to improve the accuracy of those transcriptions and dramatically expand the set of languages we can explore, greatly expanding our reach across the world,” said Kalev Leetaru, Founder of the GDELT Project

We are excited to see how other companies will use Chirp to enable new Speech AI use cases across a variety of languages. Chirp is available now, in Preview, in the Speech-to-Text API. See our documentation and get started with the Speech-to-Text console today. 

We’re so excited to continue investing in making our pre-trained Speech API even stronger to help developers leverage the power of voice for their businesses, programs, and applications. 

Posted in