Cloud Speech-to-Text

Speech to text conversion powered by machine learning and available for short or long-form audio

Try It Free

Powerful Speech Recognition

Google Cloud Speech-to-Text enables developers to convert audio to text by applying powerful neural network models in an easy to use API. The API recognizes 120 languages and variants, to support your global user base. You can enable voice command-and-control, transcribe audio from call centers, and more. It can process real-time streaming or pre-recorded audio, using Google’s machine learning technology.


Convert your speech to text right now

Select a language and click "Start Now" to begin recording

Powered by Machine Learning

Apply the most advanced deep learning neural network algorithms to audio for speech recognition with unparalleled accuracy. Cloud Speech-to-Text’s accuracy improves over time as Google improves the internal speech recognition technology used by Google products.

Recognizes over 120 Languages

Cloud Speech-to-Text recognizes 120 languages and variants to support your global user base. You can also filter inappropriate content in text results for all languages.

Returns Text Transcription In Real-Time for Short or Long-form Audio

Cloud Speech-to-Text can stream text results, immediately returning text as it’s recognized from streaming audio or as the user is speaking. Alternatively, Cloud Speech-to-Text can return recognized text from audio stored in a file. It’s capable of analyzing short and long-form audio.

Automatically Transcribes Proper Nouns and Context-Specific Formatting

Cloud Speech-to-Text is tailored to work well with real-life speech and can accurately transcribe proper nouns (i.e. Sundar Pichai) and appropriately format language (e.g., dates, phones numbers). Google supports more than 10x proper nouns compared to the number of words in the entire Oxford English Dictionary.

Offers Selection of Pre-Built Models, Tailored for Your Use-Case

Cloud Speech-to-Text comes with multiple pre-built speech recognition models so you can optimize for your use case (e.g., voice commands). Example: Our pre-built video transcription model is ideal for indexing or subtitling video and/or multispeaker content, and uses machine learning technology that is similar to YouTube captioning.

Model Description
command_and_search Best for short queries such as voice commands or voice search.
phone_call Best for audio that originated from a phone call (typically recorded at an 8khz sampling rate)
video Best for audio that originated from video or includes multiple speakers. Ideally the audio is recorded at a 16khz or greater sampling rate. This is a premium model that costs more than the standard rate.
default Best for audio that is not one of the specific audio models. For example, long-form audio. Ideally the audio is high-fidelity, recorded at a 16khz or greater sampling rate.

Cloud Speech-to-Text Features

Speech to text conversion powered by machine learning

Automatic Speech Recognition
Automatic Speech Recognition (ASR) powered by deep learning neural networking to power your applications like voice search or speech transcription.
Global Vocabulary
Recognizes 120 languages and variants with an extensive vocabulary.
Word Hints
Speech recognition can be customized to a specific context by providing a set of words and phrases that are likely to be spoken. Especially useful for adding custom words and names to the vocabulary and in voice-control use cases.
Real-time Streaming or Pre-recorded Audio Support
Audio input can be streamed from by an application’s microphone or sent from a pre-recorded audio file (inline or through Google Cloud Storage). Multiple audio encodings are supported, including FLAC, AMR, PCMU and Linear-16.
Noise Robustness
Handles noisy audio from many environments without requiring additional noise cancellation.
Inappropriate Content Filtering
Filter inappropriate content in text results for some languages.
Automatic Punctuation
Accurately punctuates transcriptions (i.e. commas, questions marks, and periods) with machine learning.
Model Selection
Choose from a selection of four pre-built models: default, voice commands and search, phone calls, and video transcription.

Cloud Speech-to-Text API Pricing

Powerful Speech Recognition

Cloud Speech-to-Text is priced per 15 seconds of audio processed after a 60 minute free tier. For details, please see our pricing guide.

Feature 0-60 minutes Over 60 minutes, up to 1 million minutes
Speech Recognition (all models except video) Free $0.006 USD / 15 seconds*
Video Speech Recognition $0.006 $0.012 USD / 15 seconds*

Note: The video speech recognition model is available for an introductory trial price of $0.006 per 15 seconds up through May 31, 2018.

If you pay in a currency other than USD, the prices listed in your currency on Cloud Platform SKUs apply.

This pricing is for applications on personal systems (e.g., phones, tablets, laptops, desktops). Please contact us for approval and pricing to use the Speech-to-Text API on embedded devices (e.g., cars, TVs, appliances, or speakers).

* Each request is rounded up to the nearest increment of 15 seconds. For example, if you make three separate requests, each containing 7 seconds of audio, you are billed $0.018 USD for 45 seconds (3 × 15 seconds) of audio. Fractions of seconds are included when rounding up to the nearest increment of 15 seconds. That is, 15.14 seconds are rounded up and billed as 30 seconds.