Convert your speech to text right now
Select a language and click "Start Now" to begin recording
- Automatic Speech Recognition
- Automatic Speech Recognition (ASR) powered by deep learning neural networking to power your applications like voice search or speech transcription.
- Global Vocabulary
- Recognizes 120 languages and variants with an extensive vocabulary.
- Phrase Hints
- Speech recognition can be customized to a specific context by providing a set of words and phrases that are likely to be spoken. This is especially useful for adding custom words and names to the vocabulary and in voice-control use cases.
- Real-time Streaming or Prerecorded Audio Support
- Audio input can be streamed from an application’s microphone or sent from a prerecorded audio file (inline or through Google Cloud Storage). Multiple audio encodings are supported, including FLAC, AMR, PCMU, and Linear-16.
- Auto-Detect Language BETA
- When you need to support multilingual scenarios, you can now specify two to four language codes and Cloud Speech-to-Text will identify the correct language spoken and provide the transcript.
- Noise Robustness
- Handles noisy audio from many environments without requiring additional noise cancellation.
- Inappropriate Content Filtering
- Filter inappropriate content in text results for some languages.
- Automatic Punctuation BETA
- Accurately punctuates transcriptions (e.g., commas, question marks, and periods) with machine learning.
- Model Selection
- Choose from a selection of four pre-built models: default, voice commands and search, phone calls, and video transcription.
- Speaker Diarization BETA
- Know who said what - you can now get automatic predictions about which of the speakers in a conversation spoke each utterance.
- Multichannel Recognition
- In multiparticipant recordings where each participant is recorded in a separate channel (e.g., phone call with two channels or video conference with four channels), Cloud Speech-to-Text will recognize each channel separately and then annotate the transcripts so that they follow the same order as in real life.
|Feature||Standard models (all models except enhanced phone and video)||Premium models* (enhanced phone, video)|
|0-60 Minutes||Over 60 Mins up to 1 Million Mins||0-60 Minutes||Over 60 Mins up to 1 Million Mins|
|Speech Recognition (without Data Logging - default)||Free||$0.006 / 15 seconds **||Free||$0.009 / 15 seconds **|
|Speech Recognition (with Data Logging opt-in)||Free||$0.004 / 15 seconds **||Free||$0.006 / 15 seconds **|
This pricing is for applications on personal systems (e.g., phones, tablets, laptops, desktops). Please contact us for approval and pricing to use the Speech-to-Text API on embedded devices (e.g., cars, TVs, appliances, or speakers).
* Currently available for US English only
** Each request is rounded up to the nearest increment of 15 seconds. For example, if you make three separate requests (Standard model), each containing 7 seconds of audio, you are billed $0.018 USD for 45 seconds (3 × 15 seconds) of audio. Fractions of seconds are included when rounding up to the nearest increment of 15 seconds. That is, 15.14 seconds are rounded up and billed as 30 seconds.
A product or feature listed on this page is in beta. For more information on our product launch stages, see here.