This page describes how to use a specific machine learning model for audio transcription requests to Speech-to-Text.
Transcription models
Speech-to-Text detects words in an audio clip by comparing input to one of many machine learning models. Each model has been trained by analyzing millions of examples—in this case, many, many audio recordings of people speaking.
Speech-to-Text has specialized models trained from audio from specific sources, for example phone calls or videos. Because of this training process, these specialized models provide better results when applied towards similar kinds of audio data.
For example, Speech-to-Text has a transcription model trained
to recognize speech recorded over the phone. When Speech-to-Text
uses the telephony
or telephony_short
model to transcribe phone audio, it
produces more accurate transcription results than if it had transcribed phone
audio using the latest_short
or latest_long
models.
The following table shows the transcriptions models available for use with Speech-to-Text.
Model name | Description |
---|---|
latest_long |
Use this model for any kind of long form content such as media or spontaneous speech and conversations. Consider using this model in place of the video model, especially if the video model is not available in your target language. You can also use this in place of the default model. |
latest_short |
Use this model for short utterances that are a few seconds in length. It is useful for trying to capture commands or other single shot directed speech use cases. Consider using this model instead of the command and search model. |
telephony |
Improved version of the "phone_call" model, best for audio that originated from a phone call, typically recorded at an 8kHz sampling rate. |
telephony_short |
Dedicated version of the modern "telephony" model for short or even single-word utterances for audio that originated from a phone call, typically recorded at an 8kHz sampling rate. |
medical_dictation |
Use this model to transcribe notes dictated by a medical
professional.
This is a premium model that costs more than the standard rate. See the pricing page for more details. |
medical_conversation |
Use this model to transcribe a conversation between a medical
professional and a patient.
This is a premium model that costs more than the standard rate. See the pricing page for more details. |
The following models are mostly based on classic non-conformer architectures and are primarily kept for legacy and backwards-compatibility reasons. | |
command_and_search |
Best for short or single-word utterances like voice commands or voice search. |
default |
Best for audio that does not fit the other audio models, like long-form audio or dictation. The default model will produce transcription results for any type of audio, including audio such as video clips that have a separate model specifically tailored to it. However, recognizing video clip audio using the default model will likely yield lower-quality results than using the video model. Ideally the audio is high-fidelity, recorded at a 16kHz or greater sampling rate. |
phone_call |
Best for audio that originated from a phone call (typically recorded at an 8kHz sampling rate). |
video |
Best for audio from video clips or other sources (such as podcasts) that have multiple speakers. This model is also often the best choice for audio that was recorded with a high-quality microphone or that has lots of background noise. For best results, provide audio recorded at 16,000Hz or greater sampling rate. |
Select a model for audio transcription
To specify a specific model to use for audio transcription, you
must set the model
field to one of the allowed values—such as latest_long
,
latest_short
, telephony
, or telephony_short
—in the
RecognitionConfig
parameters for the request.
Speech-to-Text supports model selection for all speech
recognition methods: speech:recognize
,
speech:longrunningrecognize
,
and Streaming.
Perform transcription of a local audio file
Protocol
Refer to the speech:recognize
API endpoint for
complete details.
To perform synchronous speech recognition, make a POST
request and provide the
appropriate request body. The following shows an example of a POST
request using
curl
. The example uses the Google Cloud CLI to generate an access
token. For instructions on installing the gcloud CLI,
see the quickstart.
curl -s -H "Content-Type: application/json" \ -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ https://speech.googleapis.com/v1/speech:recognize \ --data '{ "config": { "encoding": "LINEAR16", "sampleRateHertz": 16000, "languageCode": "en-US", "model": "video" }, "audio": { "uri": "gs://cloud-samples-tests/speech/Google_Gnome.wav" } }'
See the RecognitionConfig
reference
documentation for more information on configuring the request body.
If the request is successful, the server returns a 200 OK
HTTP
status code and the response in JSON format:
{ "results": [ { "alternatives": [ { "transcript": "OK Google stream stranger things from Netflix to my TV okay stranger things from Netflix playing on TV from the people that brought you Google home comes the next evolution of the smart home and it's just outside your window me Google know hi how can I help okay no what's the weather like outside the weather outside is sunny and 76 degrees he's right okay no turn on the hose I'm holding sure okay no I'm can I eat this lemon tree leaf yes what about this Daisy yes but I wouldn't recommend it but I could eat it okay Nomad milk to my shopping list I'm sorry that sounds like an indoor request I keep doing that sorry you do keep doing that okay no is this compost really we're all compost if you think about it pretty much everything is made up of organic matter and will return", "confidence": 0.9251011 } ] } ] }
Go
To learn how to install and use the client library for Speech-to-Text, see Speech-to-Text client libraries. For more information, see the Speech-to-Text Go API reference documentation.
To authenticate to Speech-to-Text, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
To learn how to install and use the client library for Speech-to-Text, see Speech-to-Text client libraries. For more information, see the Speech-to-Text Java API reference documentation.
To authenticate to Speech-to-Text, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for Speech-to-Text, see Speech-to-Text client libraries. For more information, see the Speech-to-Text Node.js API reference documentation.
To authenticate to Speech-to-Text, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for Speech-to-Text, see Speech-to-Text client libraries. For more information, see the Speech-to-Text Python API reference documentation.
To authenticate to Speech-to-Text, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Additional languages
C#: Please follow the C# setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for .NET.
PHP: Please follow the PHP setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for PHP.
Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for Ruby.
Perform transcription of a Cloud Storage audio file
Go
To learn how to install and use the client library for Speech-to-Text, see Speech-to-Text client libraries. For more information, see the Speech-to-Text Go API reference documentation.
To authenticate to Speech-to-Text, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
To learn how to install and use the client library for Speech-to-Text, see Speech-to-Text client libraries. For more information, see the Speech-to-Text Java API reference documentation.
To authenticate to Speech-to-Text, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
To learn how to install and use the client library for Speech-to-Text, see Speech-to-Text client libraries. For more information, see the Speech-to-Text Node.js API reference documentation.
To authenticate to Speech-to-Text, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Additional languages
C#: Please follow the C# setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for .NET.
PHP: Please follow the PHP setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for PHP.
Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for Ruby.