Adding recognition metadata

This page describes how to add additional details about the source audio included in a speech recognition request to Speech-to-Text.

Speech-to-Text has several machine learning models to use for converting recorded audio into text. Each of these models has been trained based upon specific characteristics of audio input, including the type of audio file, the original recording device, the distance of the speaker from the recording device, the number of speakers on the audio file, and other factors.

When you send a transcription request to Speech-to-Text, you can include these additional details about the audio data as recognition metadata that you send. Speech-to-Text can use these details to more accurately transcribe your audio data.

Google also analyzes and aggregates the most common use cases for the Speech-to-Text by collecting this metadata. Google can then prioritize the most prominent use cases for improvements to Speech-to-Text.

Available metadata fields

You can provide any of the fields in the following list in the metadata of a transcription request.

Field Type Description
interactionType ENUM The use case of the audio.
industryNaicsCodeOfAudio number The industry vertical of the audio file, as a 6-digit NAICS code.
microphoneDistance ENUM The distance of the microphone from the speaker.
originalMediaType ENUM The original media of the audio, either audio or video.
recordingDeviceType ENUM The kind of device used to capture the audio, including smartphones, PC microphones, vehicles, etc.
recordingDeviceName string The device used to make the recording. This arbitrary string can include names like 'Pixel XL', 'VoIP', 'Cardioid Microphone', or other value.
originalMimeType string The MIME type of the original audio file. Examples include audio/m4a, audio/x-alaw-basic, audio/mp3, audio/3gpp, or other audio file MIME type.
obfuscatedId string The privacy-protected ID of the user, to identify number of unique users using the service.
audioTopic string An arbitrary description of the subject matter discussed in the audio file. Examples include "Guided tour of New York City," "court trial hearing," or "live interview between 2 people."

See the RecognitionMetadata reference documentation for more information about these fields.

Use recognition metadata

To add recognition metadata to a speech recognition request to the Speech-to-Text API, set the metadata field of the speech recognition request to a RecognitionMetadata object. The Speech-to-Text API supports recognition metadata for all speech recognition methods: speech:recognize speech:longrunningrecognize, and Streaming. See the RecognitionMetadata reference documentation for more information on the types of metadata that you can include with your request.

The following code demonstrate how to specify additional metadata fields in a transcription request.


Refer to the speech:recognize API endpoint for complete details.

To perform synchronous speech recognition, make a POST request and provide the appropriate request body. The following shows an example of a POST request using curl. The example uses the access token for a service account set up for the project using the Google Cloud Cloud SDK. For instructions on installing the Cloud SDK, setting up a project with a service account, and obtaining an access token, see the quickstart.

curl -s -H "Content-Type: application/json" \
    -H "Authorization: Bearer "$(gcloud auth print-access-token) \ \
    --data '{
    "config": {
        "encoding": "FLAC",
        "sampleRateHertz": 16000,
        "languageCode": "en-US",
        "enableWordTimeOffsets":  false,
        "metadata": {
            "interactionType": "VOICE_SEARCH",
            "industryNaicsCodeOfAudio": 23810,
            "microphoneDistance": "NEARFIELD",
            "originalMediaType": "AUDIO",
            "recordingDeviceType": "OTHER_INDOOR_DEVICE",
            "recordingDeviceName": "Polycom SoundStation IP 6000",
            "originalMimeType": "audio/mp3",
            "obfuscatedId": "11235813",
            "audioTopic": "questions about landmarks in NYC"
    "audio": {

See the RecognitionConfig reference documentation for more information on configuring the request body.

If the request is successful, the server returns a 200 OK HTTP status code and the response in JSON format:

  "results": [
      "alternatives": [
          "transcript": "how old is the Brooklyn Bridge",
          "confidence": 0.98360395


 * Please include the following imports to run this sample.
 * import;
 * import;
 * import;
 * import;
 * import;
 * import;
 * import;
 * import;
 * import;
 * import java.nio.file.Files;
 * import java.nio.file.Path;
 * import java.nio.file.Paths;

public static void sampleRecognize() {
  // TODO(developer): Replace these variables before running the sample.
  String localFilePath = "resources/commercial_mono.wav";

 * Adds additional details short audio file included in this recognition request
 * @param localFilePath Path to local audio file, e.g. /path/audio.wav
public static void sampleRecognize(String localFilePath) {
  try (SpeechClient speechClient = SpeechClient.create()) {

    // The use case of the audio, e.g. PHONE_CALL, DISCUSSION, PRESENTATION, et al.
    RecognitionMetadata.InteractionType interactionType =

    // The kind of device used to capture the audio
    RecognitionMetadata.RecordingDeviceType recordingDeviceType =

    // The device used to make the recording.
    // Arbitrary string, e.g. 'Pixel XL', 'VoIP', 'Cardioid Microphone', or other value.
    String recordingDeviceName = "Pixel 3";
    RecognitionMetadata metadata =

    // The language of the supplied audio. Even though additional languages are
    // provided by alternative_language_codes, a primary language is still required.
    String languageCode = "en-US";
    RecognitionConfig config =
    Path path = Paths.get(localFilePath);
    byte[] data = Files.readAllBytes(path);
    ByteString content = ByteString.copyFrom(data);
    RecognitionAudio audio = RecognitionAudio.newBuilder().setContent(content).build();
    RecognizeRequest request =
    RecognizeResponse response = speechClient.recognize(request);
    for (SpeechRecognitionResult result : response.getResultsList()) {
      // First alternative is the most probable result
      SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
      System.out.printf("Transcript: %s\n", alternative.getTranscript());
  } catch (Exception exception) {
    System.err.println("Failed to create the client due to: " + exception);


// Imports the Google Cloud client library for Beta API
 * TODO(developer): Update client library import to use new
 * version of API when desired features become available
const speech = require('@google-cloud/speech').v1p1beta1;
const fs = require('fs');

// Creates a client
const client = new speech.SpeechClient();

async function syncRecognizeWithMetaData() {
   * TODO(developer): Uncomment the following lines before running the sample.
  // const filename = 'Local path to audio file, e.g. /path/to/audio.raw';
  // const encoding = 'Encoding of the audio file, e.g. LINEAR16';
  // const sampleRateHertz = 16000;
  // const languageCode = 'BCP-47 language code, e.g. en-US';

  const recognitionMetadata = {
    interactionType: 'DISCUSSION',
    microphoneDistance: 'NEARFIELD',
    recordingDeviceType: 'SMARTPHONE',
    recordingDeviceName: 'Pixel 2 XL',
    industryNaicsCodeOfAudio: 519190,

  const config = {
    encoding: encoding,
    sampleRateHertz: sampleRateHertz,
    languageCode: languageCode,
    metadata: recognitionMetadata,

  const audio = {
    content: fs.readFileSync(filename).toString('base64'),

  const request = {
    config: config,
    audio: audio,

  // Detects speech in the audio file
  const [response] = await client.recognize(request);
  response.results.forEach(result => {
    const alternative = result.alternatives[0];


from import speech_v1p1beta1 as speech

client = speech.SpeechClient()

speech_file = "resources/commercial_mono.wav"

with, "rb") as audio_file:
    content =

# Here we construct a recognition metadata object.
# Most metadata fields are specified as enums that can be found
# in speech.enums.RecognitionMetadata
metadata = speech.RecognitionMetadata()
metadata.interaction_type = speech.RecognitionMetadata.InteractionType.DISCUSSION
metadata.microphone_distance = (
metadata.recording_device_type = (

# Some metadata fields are free form strings
metadata.recording_device_name = "Pixel 2 XL"
# And some are integers, for instance the 6 digit NAICS code
metadata.industry_naics_code_of_audio = 519190

audio = speech.RecognitionAudio(content=content)
config = speech.RecognitionConfig(
    # Add this in the request to send metadata.

response = client.recognize(config=config, audio=audio)

for i, result in enumerate(response.results):
    alternative = result.alternatives[0]
    print("-" * 20)
    print(u"First alternative of result {}".format(i))
    print(u"Transcript: {}".format(alternative.transcript))