This page describes how to add additional details about the source audio included in a speech recognition request to Speech-to-Text.
Speech-to-Text has several machine learning models to use for converting recorded audio into text. Each of these models has been trained based upon specific characteristics of audio input, including the type of audio file, the original recording device, the distance of the speaker from the recording device, the number of speakers on the audio file, and other factors.
When you send a transcription request to Speech-to-Text, you can include these additional details about the audio data as recognition metadata that you send. Speech-to-Text can use these details to more accurately transcribe your audio data.
Google also analyzes and aggregates the most common use cases for the Speech-to-Text by collecting this metadata. Google can then prioritize the most prominent use cases for improvements to Speech-to-Text.
Available metadata fields
You can provide any of the fields in the following list in the metadata of a transcription request.
Field | Type | Description |
---|---|---|
interactionType |
ENUM |
The use case of the audio. |
industryNaicsCodeOfAudio |
number | The industry vertical of the audio file, as a 6-digit NAICS code. |
microphoneDistance |
ENUM |
The distance of the microphone from the speaker. |
originalMediaType |
ENUM |
The original media of the audio, either audio or video. |
recordingDeviceType |
ENUM |
The kind of device used to capture the audio, including smartphones, PC microphones, vehicles, etc. |
recordingDeviceName |
string | The device used to make the recording. This arbitrary string can include names like 'Pixel XL', 'VoIP', 'Cardioid Microphone', or other value. |
originalMimeType |
string | The MIME type of the original audio file. Examples include audio/m4a, audio/x-alaw-basic, audio/mp3, audio/3gpp, or other audio file MIME type. |
obfuscatedId |
string | The privacy-protected ID of the user, to identify number of unique users using the service. |
audioTopic |
string | An arbitrary description of the subject matter discussed in the audio file. Examples include "Guided tour of New York City," "court trial hearing," or "live interview between 2 people." |
See the RecognitionMetadata
reference documentation
for more information about these fields.
Use recognition metadata
To add recognition metadata to a speech recognition request to the
Speech-to-Text API, set the metadata
field of the speech
recognition request to a
RecognitionMetadata
object.
The Speech-to-Text API supports recognition metadata for all
speech recognition methods:
speech:recognize
speech:longrunningrecognize
,
and Streaming. See the
RecognitionMetadata
reference documentation for more
information on the types of metadata that you can include with your request.
The following code demonstrate how to specify additional metadata fields in a transcription request.
Protocol
Refer to the speech:recognize
API endpoint for complete
details.
To perform synchronous speech recognition, make a POST
request and provide the
appropriate request body. The following shows an example of a POST
request using
curl
. The example uses the access token for a service account set up for the
project using the Google Cloud
Cloud SDK. For instructions on installing the Cloud SDK,
setting up a project with a service account, and obtaining an access token,
see the quickstart.
curl -s -H "Content-Type: application/json" \ -H "Authorization: Bearer "$(gcloud auth print-access-token) \ https://speech.googleapis.com//speech:recognize \ --data '{ "config": { "encoding": "FLAC", "sampleRateHertz": 16000, "languageCode": "en-US", "enableWordTimeOffsets": false, "metadata": { "interactionType": "VOICE_SEARCH", "industryNaicsCodeOfAudio": 23810, "microphoneDistance": "NEARFIELD", "originalMediaType": "AUDIO", "recordingDeviceType": "OTHER_INDOOR_DEVICE", "recordingDeviceName": "Polycom SoundStation IP 6000", "originalMimeType": "audio/mp3", "obfuscatedId": "11235813", "audioTopic": "questions about landmarks in NYC" } }, "audio": { "uri":"gs://cloud-samples-tests/speech/brooklyn.flac" } }
See the RecognitionConfig
reference documentation for
more information on configuring the request body.
If the request is successful, the server returns a 200 OK
HTTP status code and
the response in JSON format:
{ "results": [ { "alternatives": [ { "transcript": "how old is the Brooklyn Bridge", "confidence": 0.98360395 } ] } ] }
Java
Node.js
Python