This page describes how to get labels for different speakers in audio data transcribed by Speech-to-Text.
Sometimes, audio data contains samples of more than one person talking. For example, audio from a telephone call usually features voices from two or more people. A transcription of the call ideally includes who speaks at which times.
Speaker diarization
Speech-to-Text can recognize multiple speakers in the same audio clip. When you send an audio transcription request to Speech-to-Text, you can include a parameter telling Speech-to-Text to identify the different speakers in the audio sample. This feature, called speaker diarization, detects when speakers change and labels by number the individual voices detected in the audio.
When you enable speaker diarization in your transcription request, Speech-to-Text attempts to distinguish the different voices included in the audio sample. The transcription result tags each word with a number assigned to individual speakers. Words spoken by the same speaker bear the same number. A transcription result can include numbers up to as many speakers as Speech-to-Text can uniquely identify in the audio sample.
When you use speaker diarization, Speech-to-Text produces a running
aggregate of all the results provided in the transcription. Each
result includes the words from the previous result. Thus, the words
array in the final result provides the complete, diarized results
of the transcription.
Review the language support page to see if this feature is available for your language.
Enable speaker diarization in a request
To enable speaker diarization, you need to set the
diarization_config
field in
RecognitionFeatures
. You must set the min_speaker_count
and max_speaker_count
values according to how many speakers you expect in the transcript.
Speech-to-Text supports speaker
diarization for all speech recognition methods:
speech:recognize
and Streaming.
Use a local file
The following code snippet demonstrates how to enable speaker diarization in a transcription request to Speech-to-Text using a local file
Protocol
Refer to the speech:recognize
API endpoint for
complete details.
To perform synchronous speech recognition, make a POST
request and provide the
appropriate request body. The following shows an example of a POST
request using
curl
. The example uses the Google Cloud CLI to generate an access
token. For instructions on installing the gcloud CLI,
see the quickstart.
curl -s -H "Content-Type: application/json" \ -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ https://speech.googleapis.com/v2/projects/{project}/locations/{location}/recognizers/{recognizer}:recognize \ --data '{ "config": { "features": { "diarizationConfig": { "minSpeakerCount": 2, "maxSpeakerCount": 2 }, } }, "uri": "gs://cloud-samples-tests/speech/commercial_mono.wav" }' > speaker-diarization.txt
If the request is successful, the server returns a 200 OK
HTTP
status code and the response in JSON format, saved to a file named
speaker-diarization.txt
.
{ "results": [ { "alternatives": [ { "transcript": "hi I'd like to buy a Chromecast and I was wondering whether you could help me with that certainly which color would you like we have blue black and red uh let's go with the black one would you like the new Chromecast Ultra model or the regular Chrome Cast regular Chromecast is fine thank you okay sure we like to ship it regular or Express Express please terrific it's on the way thank you thank you very much bye", "confidence": 0.92142606, "words": [ { "startOffset": "0s", "endOffset": "1.100s", "word": "hi", "speakerLabel": "2" }, { "startOffset": "1.100s", "endOffset": "2s", "word": "I'd", "speakerLabel": "2" }, { "startOffset": "2s", "endOffset": "2s", "word": "like", "speakerLabel": "2" }, { "startOffset": "2s", "endOffset": "2.100s", "word": "to", "speakerLabel": "2" }, ... { "startOffset": "6.500s", "endOffset": "6.900s", "word": "certainly", "speakerLabel": "1" }, { "startOffset": "6.900s", "endOffset": "7.300s", "word": "which", "speakerLabel": "1" }, { "startOffset": "7.300s", "endOffset": "7.500s", "word": "color", "speakerLabel": "1" }, ... ] } ], "languageCode": "en-us" } ] }
Go
To learn how to install and use the client library for Speech-to-Text, see Speech-to-Text client libraries. For more information, see the Speech-to-Text Go API reference documentation.
To authenticate to Speech-to-Text, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install and use the client library for Speech-to-Text, see Speech-to-Text client libraries. For more information, see the Speech-to-Text Python API reference documentation.
To authenticate to Speech-to-Text, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.