Speech-to-Text

Accurately convert speech into text using an API powered by Google’s AI technologies.

Try it for free
  • action/check_circle_24px Created with Sketch.

    Transcribe your content with accurate captions

  • action/check_circle_24px Created with Sketch.

    Deliver better user experience in products through voice commands

  • action/check_circle_24px Created with Sketch.

    Gain insights from customer interactions to improve your service

State-of-the-art accuracy

Apply Google’s most advanced deep learning neural network algorithms for automatic speech recognition (ASR).

Global reach

Meet your users where they are, globally, with voice recognition that supports more than 125 languages and variants.

Accelerated innovation

Combine with the best of Google’s technologies in Text-to-Speech and Natural Language to unlock use cases like voice bots and sentiment analysis for speech.

Put Speech-to-Text into action

Key features

Speech adaptation

Customize speech recognition to transcribe domain-specific terms and rare words by providing hints and boost your transcription accuracy of specific words or phrases. Automatically convert spoken numbers into addresses, years, currencies, and more using classes.

Domain-specific models

Choose from a selection of trained models for voice control and phone call and video transcription optimized for domain-specific quality requirements. For example, our enhanced phone call model is tuned for audio originated from telephony, such as phone calls recorded at an 8khz sampling rate.

Streaming speech recognition

Receive real-time speech recognition results as the API processes the audio input streamed from your application’s microphone or sent from a prerecorded audio file (inline or through Cloud Storage).

View all features

Customers

Castbox uses Speech-to-Text to deliver its in-audio search service for podcasts.
Read the story

Story highlights

  • Enabling users to search audio content for words or phrases

  • Audio-to-text conversion accuracy rates of greater than 96%

  • Typical search queries with latency of just 50 milliseconds

Industry

  • Technology

Documentation

Google Cloud Basics
Speech-to-Text basics

Learn the fundamental concepts in Speech-to-Text.

Quickstart
Quickstart: Using the gcloud tool

Send an audio transcription request to Speech-to-Text using the gcloud tool from the command line.

Best Practice
Best practices

Review the best practices for transcribing audio with Speech-to-Text.

Tutorial
ML onramp

Explore Speech-to-Text tutorials, codelabs, and more.

Google Cloud Basics
Supported languages

Learn which languages are available for Speech-to-Text, plus the features and recognition models available for each.

Use cases

Use case
Improve customer service

Empower your customer service system by adding IVR (interactive voice response) and agent conversations to your call centers. Perform analytics on your conversation data to gain more insights into the calls and your customers. Speech-to-Text and its enhanced phone call models are already powering Google Cloud’s powerful solution, Contact Center AI.

Using contact center AI with speech to text technology to improve customer service
Use case
Enable voice control

Implement voice commands such as “turn the volume up,” and voice search such as saying “what is the temperature in Paris?” Combine this with the Text-to-Speech API to deliver voice-enabled experiences in IoT (Internet of Things) applications.

Workflow of voice control using speech to text API
Use case
Transcribe multimedia content

Transcribe your audio and video to include captions and improve your audience reach and experience. Add subtitles to your content real time to your streaming content. Our video transcription model is ideal for indexing or subtitling video and/or multispeaker content and uses machine learning technology that is similar to video captioning on YouTube.

Transcribe multimedia content workflow

All features

Global vocabulary Support your global user base with Speech-to-Text’s extensive language support in over 125 languages and variants.
Streaming speech recognition Receive real-time speech recognition results as the API processes the audio input streamed from your application’s microphone or sent from a prerecorded audio file (inline or through Cloud Storage).
Speech adaptation Customize speech recognition to transcribe domain-specific terms and rare words by providing hints and boost your transcription accuracy of specific words or phrases. Automatically convert spoken numbers into addresses, years, currencies, and more using classes.
Multichannel recognition Speech-to-Text can recognize distinct channels in multichannel situations (e.g., video conference) and annotate the transcripts to preserve the order.
Noise robustness Speech-to-Text can handle noisy audio from many environments without requiring additional noise cancellation.
Domain-specific models Choose from a selection of trained models for voice control and phone call and video transcription optimized for domain-specific quality requirements. For example, our enhanced phone call model is tuned for audio originated from telephony, such as phone calls recorded at an 8khz sampling rate.
Content filtering Profanity filter helps you detect inappropriate or unprofessional content in your audio data and filter out profane words in text results.
Auto-detect language (beta) Specify up to four language codes and Speech-to-Text will identify the correct language spoken in multilingual scenarios.
Automatic punctuation (beta) Speech-to-Text accurately punctuates transcriptions (e.g., commas, question marks, and periods).
Speaker diarization (beta) Know who said what by receiving automatic predictions about which of the speakers in a conversation spoke each utterance.

Pricing

Speech-to-Text is priced per 15 seconds of audio processed after a 60-minute free tier.