Adding Recognition Metadata

This page describes how to add additional details about the source audio included in a speech recognition request to Cloud Speech-to-Text.

Cloud Speech-to-Text has several machine learning models to use for converting recorded audio into text. Each of these models has been trained based upon specific characteristics of audio input, including the type of audio file, the original recording device, the distance of the speaker from the recording device, the number of speakers on the audio file, and other factors.

When you send a transcription request to Speech-to-Text , you can include these additional details about the audio data as recognition metadata that you send. Speech-to-Text can use these details to more accurately transcribe your audio data.

Google also analyzes and aggregates the most common use cases for the Speech-to-Text by collecting this metadata. Google can then prioritize the most prominent use cases for improvements to Cloud Speech-to-Text.

Available metadata fields

You can provide any of the fields in the following list in the metadata of a transcription request.

Field Type Description
interactionType InteractionType The use case of the audio.
industryNaicsCodeOfAudio number The industry vertical of the audio file, as a 6-digit NAICS code.
microphoneDistance MicrophoneDistance The distance of the microphone from the speaker.
originalMediaType OriginalMediaType The original media of the audio, either audio or video.
recordingDeviceType RecordingDeviceType The kind of device used to capture the audio, including smartphones, PC microphones, vehicles, etc.
recordingDeviceName string The device used to make the recording. This arbitrary string can include names like 'Pixel XL', 'VoIP', 'Cardioid Microphone', or other value.
originalMimeType string The MIME type of the original audio file. Examples include audio/m4a, audio/x-alaw-basic, audio/mp3, audio/3gpp, or other audio file MIME type.
obfuscatedId string The privacy-protected ID of the user, to identify number of unique users using the service.
audioTopic string An arbitrary description of the subject matter discussed in the audio file. Examples include "Guided tour of New York City," "court trial hearing," or "live interview between 2 people."

See the RecognitionMetadata reference documentation for more information about these fields.

Use recognition metadata

To add recognition metadata to a speech recognition request to the Speech-to-Text API, set the metadata field of the speech recognition request to a RecognitionMetadata object. The Speech-to-Text API supports recognition metadata for all speech recognition methods: speech:recognize, speech:longrunningrecognize, and StreamingRecognizeRequest. See the RecognitionMetadata reference documentation for more information on the types of metadata that you can include with your request.

The following code demonstrate how to specify additional metadata fields in a transcription request.

Protocol

Refer to the speech:recognize API endpoint for complete details.

To perform synchronous speech recognition, make a POST request and provide the appropriate request body. The following shows an example of a POST request using curl. The example uses the access token for a service account set up for the project using the Google Cloud Platform Cloud SDK. For instructions on installing the Cloud SDK, setting up a project with a service account, and obtaining an access token, see the Quickstart.

curl -s -H "Content-Type: application/json" \
    -H "Authorization: Bearer "$(gcloud auth print-access-token) \
    https://speech.googleapis.com/v1p1beta1/speech:recognize \
    --data '{
    "config": {
        "encoding": "FLAC",
        "sampleRateHertz": 16000,
        "languageCode": "en-US",
        "enableWordTimeOffsets":  false,
        "metadata": {
            "interactionType": "VOICE_SEARCH",
            "industryNaicsCodeOfAudio": 23810,
            "microphoneDistance": "NEARFIELD",
            "originalMediaType": "AUDIO",
            "recordingDeviceType": "OTHER_INDOOR_DEVICE",
            "recordingDeviceName": "Polycom SoundStation IP 6000",
            "originalMimeType": "audio/mp3",
            "obfuscatedId": "11235813",
            "audioTopic": "questions about landmarks in NYC"
        }
    },
    "audio": {
        "uri":"gs://cloud-samples-tests/speech/brooklyn.flac"
    }'
}

See the RecognitionConfig reference documentation for more information on configuring the request body.

If the request is successful, the server returns a 200 OK HTTP status code and the response in JSON format:

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "how old is the Brooklyn Bridge",
          "confidence": 0.98360395
        }
      ]
    }
  ]
}

Java

/**
 * Transcribe the given audio file and include recognition metadata in the request.
 *
 * @param fileName the path to an audio file.
 */
public static void transcribeFileWithMetadata(String fileName) throws Exception {
  Path path = Paths.get(fileName);
  byte[] content = Files.readAllBytes(path);

  try (SpeechClient speechClient = SpeechClient.create()) {
    // Get the contents of the local audio file
    RecognitionAudio recognitionAudio =
        RecognitionAudio.newBuilder().setContent(ByteString.copyFrom(content)).build();

    // Construct a recognition metadata object.
    // Most metadata fields are specified as enums that can be found
    // in speech.enums.RecognitionMetadata
    RecognitionMetadata metadata =
        RecognitionMetadata.newBuilder()
            .setInteractionType(InteractionType.DISCUSSION)
            .setMicrophoneDistance(MicrophoneDistance.NEARFIELD)
            .setRecordingDeviceType(RecordingDeviceType.SMARTPHONE)
            .setRecordingDeviceName("Pixel 2 XL") // Some metadata fields are free form strings
            // And some are integers, for instance the 6 digit NAICS code
            // https://www.naics.com/search/
            .setIndustryNaicsCodeOfAudio(519190)
            .build();

    // Configure request to enable enhanced models
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            .setEncoding(AudioEncoding.LINEAR16)
            .setLanguageCode("en-US")
            .setSampleRateHertz(8000)
            .setMetadata(metadata) // Add the metadata to the config
            .build();

    // Perform the transcription request
    RecognizeResponse recognizeResponse = speechClient.recognize(config, recognitionAudio);

    // Print out the results
    for (SpeechRecognitionResult result : recognizeResponse.getResultsList()) {
      // There can be several alternative transcripts for a given chunk of speech. Just use the
      // first (most likely) one here.
      SpeechRecognitionAlternative alternative = result.getAlternatives(0);
      System.out.format("Transcript: %s\n\n", alternative.getTranscript());
    }
  }
}

Node.js

// Imports the Google Cloud client library for Beta API
/**
 * TODO(developer): Update client library import to use new
 * version of API when desired features become available
 */
const speech = require('@google-cloud/speech').v1p1beta1;
const fs = require('fs');

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const filename = 'Local path to audio file, e.g. /path/to/audio.raw';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const recognitionMetadata = {
  interactionType: 'DISCUSSION',
  microphoneDistance: 'NEARFIELD',
  recordingDeviceType: 'SMARTPHONE',
  recordingDeviceName: 'Pixel 2 XL',
  industryNaicsCodeOfAudio: 519190,
};

const config = {
  encoding: encoding,
  languageCode: languageCode,
  metadata: recognitionMetadata,
};

const audio = {
  content: fs.readFileSync(filename).toString('base64'),
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file
client
  .recognize(request)
  .then(data => {
    const response = data[0];
    response.results.forEach(result => {
      const alternative = result.alternatives[0];
      console.log(alternative.transcript);
    });
  })
  .catch(err => {
    console.error('ERROR:', err);
  });

Python

from google.cloud import speech_v1p1beta1 as speech
client = speech.SpeechClient()

speech_file = 'resources/commercial_mono.wav'

with io.open(speech_file, 'rb') as audio_file:
    content = audio_file.read()

# Here we construct a recognition metadata object.
# Most metadata fields are specified as enums that can be found
# in speech.enums.RecognitionMetadata
metadata = speech.types.RecognitionMetadata()
metadata.interaction_type = (
    speech.enums.RecognitionMetadata.InteractionType.DISCUSSION)
metadata.microphone_distance = (
    speech.enums.RecognitionMetadata.MicrophoneDistance.NEARFIELD)
metadata.recording_device_type = (
    speech.enums.RecognitionMetadata.RecordingDeviceType.SMARTPHONE)
# Some metadata fields are free form strings
metadata.recording_device_name = "Pixel 2 XL"
# And some are integers, for instance the 6 digit NAICS code
# https://www.naics.com/search/
metadata.industry_naics_code_of_audio = 519190

audio = speech.types.RecognitionAudio(content=content)
config = speech.types.RecognitionConfig(
    encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=8000,
    language_code='en-US',
    # Add this in the request to send metadata.
    metadata=metadata)

response = client.recognize(config, audio)

for i, result in enumerate(response.results):
    alternative = result.alternatives[0]
    print('-' * 20)
    print('First alternative of result {}'.format(i))
    print('Transcript: {}'.format(alternative.transcript))

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Speech API Documentation