Separating different speakers in an audio recording

This page describes how to get labels for different speakers in audio data transcribed by Speech-to-Text.

Sometimes, audio data contains samples of more than one person talking. For example, audio from a telephone call usually features voices from two or more people. A transcription of the call ideally includes who speaks at which times.

Speaker diarization

Speech-to-Text can recognize multiple speakers in the same audio clip. When you send an audio transcription request to Speech-to-Text, you can include a parameter telling Speech-to-Text to identify the different speakers in the audio sample. This feature, called speaker diarization, detects when speakers change and labels by number the individual voices detected in the audio.

When you enable speaker diarization in your transcription request, Speech-to-Text attempts to distinguish the different voices included in the audio sample. The transcription result tags each word with a number assigned to individual speakers. Words spoken by the same speaker bear the same number. A transcription result can include numbers up to as many speakers as Speech-to-Text can uniquely identify in the audio sample.

When you use speaker diarization, Speech-to-Text produces a running aggregate of all the results provided in the transcription. Each result includes the words from the previous result. Thus, the words array in the final result provides the complete, diarized results of the transcription.

Review the language support page to see if this feature is available for your language.

Enabling speaker diarization in a request

To enable speaker diarization, you need to set the enableSpeakerDiarization field to true in the RecognitionConfig parameters for the request. To improve your transcription results, you should also specify the number of speakers present in the audio clip by setting the diarizationSpeakerCount field in the RecognitionConfig parameters. Speech-to-Text uses a default value if you do not provide a value for diarizationSpeakerCount.

Speech-to-Text supports speaker diarization for all speech recognition methods: speech:recognize speech:longrunningrecognize, and Streaming.

Using a local file

The following code snippet demonstrates how to enable speaker diarization in a transcription request to Speech-to-Text using a local file

Protocol

Refer to the speech:recognize API endpoint for complete details.

To perform synchronous speech recognition, make a POST request and provide the appropriate request body. The following shows an example of a POST request using curl. The example uses the access token for a service account set up for the project using the Google Cloud Cloud SDK. For instructions on installing the Cloud SDK, setting up a project with a service account, and obtaining an access token, see the quickstart.

curl -s -H "Content-Type: application/json" \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    https://speech.googleapis.com/v1p1beta1/speech:recognize \
    --data '{
    "config": {
        "encoding": "LINEAR16",
        "languageCode": "en-US",
        "enableSpeakerDiarization": true,
        "diarizationSpeakerCount": 2,
        "model": "phone_call"
    },
    "audio": {
        "uri": "gs://cloud-samples-tests/speech/commercial_mono.wav"
    }
}' > speaker-diarization.txt

If the request is successful, the server returns a 200 OK HTTP status code and the response in JSON format, saved to a file named speaker-diarization.txt.

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "hi I'd like to buy a Chromecast and I was wondering whether you could help me with that certainly which color would you like we have blue black and red uh let's go with the black one would you like the new Chromecast Ultra model or the regular Chrome Cast regular Chromecast is fine thank you okay sure we like to ship it regular or Express Express please terrific it's on the way thank you thank you very much bye",
          "confidence": 0.92142606,
          "words": [
            {
              "startTime": "0s",
              "endTime": "1.100s",
              "word": "hi",
              "speakerTag": 2
            },
            {
              "startTime": "1.100s",
              "endTime": "2s",
              "word": "I'd",
              "speakerTag": 2
            },
            {
              "startTime": "2s",
              "endTime": "2s",
              "word": "like",
              "speakerTag": 2
            },
            {
              "startTime": "2s",
              "endTime": "2.100s",
              "word": "to",
              "speakerTag": 2
            },
            ...
            {
              "startTime": "6.500s",
              "endTime": "6.900s",
              "word": "certainly",
              "speakerTag": 1
            },
            {
              "startTime": "6.900s",
              "endTime": "7.300s",
              "word": "which",
              "speakerTag": 1
            },
            {
              "startTime": "7.300s",
              "endTime": "7.500s",
              "word": "color",
              "speakerTag": 1
            },
            ...
          ]
        }
      ],
      "languageCode": "en-us"
    }
  ]
}

Java

/**
 * Transcribe the given audio file using speaker diarization.
 *
 * @param fileName the path to an audio file.
 */
public static void transcribeDiarization(String fileName) throws Exception {
  Path path = Paths.get(fileName);
  byte[] content = Files.readAllBytes(path);

  try (SpeechClient speechClient = SpeechClient.create()) {
    // Get the contents of the local audio file
    RecognitionAudio recognitionAudio =
        RecognitionAudio.newBuilder().setContent(ByteString.copyFrom(content)).build();

    SpeakerDiarizationConfig speakerDiarizationConfig =
        SpeakerDiarizationConfig.newBuilder()
            .setEnableSpeakerDiarization(true)
            .setMinSpeakerCount(2)
            .setMaxSpeakerCount(2)
            .build();

    // Configure request to enable Speaker diarization
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            .setEncoding(AudioEncoding.LINEAR16)
            .setLanguageCode("en-US")
            .setSampleRateHertz(8000)
            .setDiarizationConfig(speakerDiarizationConfig)
            .build();

    // Perform the transcription request
    RecognizeResponse recognizeResponse = speechClient.recognize(config, recognitionAudio);

    // Speaker Tags are only included in the last result object, which has only one alternative.
    SpeechRecognitionAlternative alternative =
        recognizeResponse.getResults(recognizeResponse.getResultsCount() - 1).getAlternatives(0);

    // The alternative is made up of WordInfo objects that contain the speaker_tag.
    WordInfo wordInfo = alternative.getWords(0);
    int currentSpeakerTag = wordInfo.getSpeakerTag();

    // For each word, get all the words associated with one speaker, once the speaker changes,
    // add a new line with the new speaker and their spoken words.
    StringBuilder speakerWords =
        new StringBuilder(
            String.format("Speaker %d: %s", wordInfo.getSpeakerTag(), wordInfo.getWord()));

    for (int i = 1; i < alternative.getWordsCount(); i++) {
      wordInfo = alternative.getWords(i);
      if (currentSpeakerTag == wordInfo.getSpeakerTag()) {
        speakerWords.append(" ");
        speakerWords.append(wordInfo.getWord());
      } else {
        speakerWords.append(
            String.format("\nSpeaker %d: %s", wordInfo.getSpeakerTag(), wordInfo.getWord()));
        currentSpeakerTag = wordInfo.getSpeakerTag();
      }
    }

    System.out.println(speakerWords.toString());
  }
}

Node.js

const fs = require('fs');

// Imports the Google Cloud client library
const speech = require('@google-cloud/speech').v1p1beta1;

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const fileName = 'Local path to audio file, e.g. /path/to/audio.raw';

const config = {
  encoding: 'LINEAR16',
  sampleRateHertz: 8000,
  languageCode: 'en-US',
  enableSpeakerDiarization: true,
  diarizationSpeakerCount: 2,
  model: 'phone_call',
};

const audio = {
  content: fs.readFileSync(fileName).toString('base64'),
};

const request = {
  config: config,
  audio: audio,
};

const [response] = await client.recognize(request);
const transcription = response.results
  .map(result => result.alternatives[0].transcript)
  .join('\n');
console.log(`Transcription: ${transcription}`);
console.log('Speaker Diarization:');
const result = response.results[response.results.length - 1];
const wordsInfo = result.alternatives[0].words;
// Note: The transcript within each result is separate and sequential per result.
// However, the words list within an alternative includes all the words
// from all the results thus far. Thus, to get all the words with speaker
// tags, you only have to take the words list from the last result:
wordsInfo.forEach(a =>
  console.log(` word: ${a.word}, speakerTag: ${a.speakerTag}`)
);

Python

from google.cloud import speech_v1p1beta1 as speech
client = speech.SpeechClient()

speech_file = 'resources/commercial_mono.wav'

with open(speech_file, 'rb') as audio_file:
    content = audio_file.read()

audio = speech.RecognitionAudio(content=content)

config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=8000,
    language_code='en-US',
    enable_speaker_diarization=True,
    diarization_speaker_count=2)

print('Waiting for operation to complete...')
response = client.recognize(config=config, audio=audio)

# The transcript within each result is separate and sequential per result.
# However, the words list within an alternative includes all the words
# from all the results thus far. Thus, to get all the words with speaker
# tags, you only have to take the words list from the last result:
result = response.results[-1]

words_info = result.alternatives[0].words

# Printing out the output:
for word_info in words_info:
    print(u"word: '{}', speaker_tag: {}".format(
        word_info.word, word_info.speaker_tag))

Using a Google Cloud Storage bucket

The following code snippet demonstrates how to enable speaker diarization in a transcription request to Speech-to-Text using a Google Cloud Storage file

Java

/**
 * Transcribe a remote audio file using speaker diarization.
 *
 * @param gcsUri the path to an audio file.
 */
public static void transcribeDiarizationGcs(String gcsUri) throws Exception {
  try (SpeechClient speechClient = SpeechClient.create()) {
    SpeakerDiarizationConfig speakerDiarizationConfig =
        SpeakerDiarizationConfig.newBuilder()
            .setEnableSpeakerDiarization(true)
            .setMinSpeakerCount(2)
            .setMaxSpeakerCount(2)
            .build();

    // Configure request to enable Speaker diarization
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            .setEncoding(AudioEncoding.LINEAR16)
            .setLanguageCode("en-US")
            .setSampleRateHertz(8000)
            .setDiarizationConfig(speakerDiarizationConfig)
            .build();

    // Set the remote path for the audio file
    RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();

    // Use non-blocking call for getting file transcription
    OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
        speechClient.longRunningRecognizeAsync(config, audio);

    while (!response.isDone()) {
      System.out.println("Waiting for response...");
      Thread.sleep(10000);
    }

    // Speaker Tags are only included in the last result object, which has only one alternative.
    LongRunningRecognizeResponse longRunningRecognizeResponse = response.get();
    SpeechRecognitionAlternative alternative =
        longRunningRecognizeResponse
            .getResults(longRunningRecognizeResponse.getResultsCount() - 1)
            .getAlternatives(0);

    // The alternative is made up of WordInfo objects that contain the speaker_tag.
    WordInfo wordInfo = alternative.getWords(0);
    int currentSpeakerTag = wordInfo.getSpeakerTag();

    // For each word, get all the words associated with one speaker, once the speaker changes,
    // add a new line with the new speaker and their spoken words.
    StringBuilder speakerWords =
        new StringBuilder(
            String.format("Speaker %d: %s", wordInfo.getSpeakerTag(), wordInfo.getWord()));

    for (int i = 1; i < alternative.getWordsCount(); i++) {
      wordInfo = alternative.getWords(i);
      if (currentSpeakerTag == wordInfo.getSpeakerTag()) {
        speakerWords.append(" ");
        speakerWords.append(wordInfo.getWord());
      } else {
        speakerWords.append(
            String.format("\nSpeaker %d: %s", wordInfo.getSpeakerTag(), wordInfo.getWord()));
        currentSpeakerTag = wordInfo.getSpeakerTag();
      }
    }

    System.out.println(speakerWords.toString());
  }
}

Node.js

// Imports the Google Cloud client library
const speech = require('@google-cloud/speech').v1p1beta1;

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following line before running the sample.
 */
// const uri = path to GCS audio file e.g. `gs:/bucket/audio.wav`;

const config = {
  encoding: 'LINEAR16',
  sampleRateHertz: 8000,
  languageCode: 'en-US',
  enableSpeakerDiarization: true,
  diarizationSpeakerCount: 2,
  model: 'phone_call',
};

const audio = {
  uri: gcsUri,
};

const request = {
  config: config,
  audio: audio,
};

const [response] = await client.recognize(request);
const transcription = response.results
  .map(result => result.alternatives[0].transcript)
  .join('\n');
console.log(`Transcription: ${transcription}`);
console.log('Speaker Diarization:');
const result = response.results[response.results.length - 1];
const wordsInfo = result.alternatives[0].words;
// Note: The transcript within each result is separate and sequential per result.
// However, the words list within an alternative includes all the words
// from all the results thus far. Thus, to get all the words with speaker
// tags, you only have to take the words list from the last result:
wordsInfo.forEach(a =>
  console.log(` word: ${a.word}, speakerTag: ${a.speakerTag}`)
);