Convert your speech to text right now
Select a language and click "Start Now" to begin recording
- Automatic Speech Recognition
- Automatic Speech Recognition (ASR) powered by deep learning neural networking to power your applications like voice search or speech transcription.
- Global Vocabulary
- Recognizes 120 languages and variants with an extensive vocabulary.
- Word Hints
- Speech recognition can be customized to a specific context by providing a set of words and phrases that are likely to be spoken. Especially useful for adding custom words and names to the vocabulary and in voice-control use cases.
- Real-time Streaming or Pre-recorded Audio Support
- Audio input can be streamed from by an application’s microphone or sent from a pre-recorded audio file (inline or through Google Cloud Storage). Multiple audio encodings are supported, including FLAC, AMR, PCMU and Linear-16.
- Noise Robustness
- Handles noisy audio from many environments without requiring additional noise cancellation.
- Inappropriate Content Filtering
- Filter inappropriate content in text results for some languages.
- Automatic Punctuation
- Accurately punctuates transcriptions (i.e. commas, questions marks, and periods) with machine learning.
- Model Selection
- Choose from a selection of four pre-built models: default, voice commands and search, phone calls, and video transcription.
|Feature||0-60 minutes||Over 60 minutes, up to 1 million minutes|
|Speech Recognition (all models except video)||Free||$0.006 USD / 15 seconds*|
|Video Speech Recognition||$0.006||$0.012 USD / 15 seconds*|
Note: The video speech recognition model is available for an introductory trial price of $0.006 per 15 seconds up through May 31, 2018.If you pay in a currency other than USD, the prices listed in your currency on Cloud Platform SKUs apply.
This pricing is for applications on personal systems (e.g., phones, tablets, laptops, desktops). Please contact us for approval and pricing to use the Speech-to-Text API on embedded devices (e.g., cars, TVs, appliances, or speakers).
* Each request is rounded up to the nearest increment of 15 seconds. For example, if you make three separate requests, each containing 7 seconds of audio, you are billed $0.018 USD for 45 seconds (3 × 15 seconds) of audio. Fractions of seconds are included when rounding up to the nearest increment of 15 seconds. That is, 15.14 seconds are rounded up and billed as 30 seconds.