Introducing Cloud Text-to-Speech powered by DeepMind WaveNet technology
Product Manager, Speech
Many Google products (e.g., the Google Assistant, Search, Maps) come with built-in high-quality text-to-speech synthesis that produces natural sounding speech. Developers have been telling us they’d like to add text-to-speech to their own applications, so today we’re bringing this technology to Google Cloud Platform with Cloud Text-to-Speech.
You can use Cloud Text-to-Speech in a variety of ways, for example:
- To power voice response systems for call centers (IVRs) and enabling real-time natural language conversations
- To enable IoT devices (e.g., TVs, cars, robots) to talk back to you
- To convert text-based media (e.g., news articles, books) into spoken format (e.g., podcast or audiobook)
Rolling in the DeepMindIn addition, we're excited to announce that Cloud Text-to-Speech also includes a selection of high-fidelity voices built using WaveNet, a generative model for raw audio created by DeepMind. WaveNet synthesizes more natural-sounding speech and, on average, produces speech audio that people prefer over other text-to-speech technologies.
In late 2016, DeepMind introduced the first version of WaveNet — a neural network trained with a large volume of speech samples that's able to create raw audio waveforms from scratch. During training, the network extracts the underlying structure of the speech, for example which tones follow one another and what shape a realistic speech waveform should have. When given text input, the trained WaveNet model generates the corresponding speech waveforms, one sample at a time, achieving higher accuracy than alternative approaches.
Fast forward to today, and we're now using an updated version of WaveNet that runs on Google’s Cloud TPU infrastructure.The new, improved WaveNet model generates raw waveforms 1,000 times faster than the original model, and can generate one second of speech in just 50 milliseconds. In fact, the model is not just quicker, but also higher-fidelity, capable of creating waveforms with 24,000 samples a second. We’ve also increased the resolution of each sample from 8 bits to 16 bits, producing higher quality audio for a more human sound.
With these adjustments, the new WaveNet model produces more natural sounding speech. In tests, people gave the new US English WaveNet voices an average mean-opinion-score (MOS) of 4.1 on a scale of 1-5 — over 20% better than for standard voices and reducing the gap with human speech by over 70%. As WaveNet voices also require less recorded audio input to produce high quality models, we expect to continue to improve both the variety as well as quality of the WaveNet voices available to Cloud customers in the coming months.
Cloud Text-to-Speech is already helping multiple customers deliver a better experience to their end users. Customers include Cisco and Dolphin ONE.
“As the leading provider of collaboration solutions, Cisco has a long history of bringing the latest technology advances into the enterprise. Google’s Cloud Text-to-Speech has enabled us to achieve the natural sound quality that our customers desire.
Tim Tuttle, CTO of Cognitive Collaboration, Cisco
As the leading provider of collaboration solutions, Cisco has a long history of bringing the latest technology advances into the enterprise. Google’s Cloud Text-to-Speech has enabled us to achieve the natural sound quality that our customers desire.
Jason Berryman, Dolphin ONE