Esta página foi traduzida pela API Cloud Translation.

Transcrição da CCAI

A Transcrição CCAI permite converter dados de áudio de streaming em texto transcrito em tempo real. O Assistente do agente faz sugestões com base em texto. Portanto, os dados de áudio precisam ser convertidos antes de serem usados. Você também pode usar áudio de streaming transcrito com o CCAI Insights para coletar dados em tempo real sobre conversas de agentes (por exemplo, modelagem de tópicos).

Há duas maneiras de transcrever áudio de streaming para uso com o CCAI: usando o recurso SIPREC ou fazendo chamadas gRPC com dados de áudio como payload. Esta página descreve o processo de transcrição de dados de áudio de streaming usando chamadas gRPC.

A Transcrição CCAI funciona usando o reconhecimento de fala de streaming do Speech-to-Text. O Speech-to-Text oferece vários modelos de reconhecimento, padrão e aprimorado. A transcrição CCAI é compatível somente no nível do GA quando usada com o modelo de ligação telefônica aprimorada.

Pré-requisitos

Crie um projeto em Google Cloud.
Ative a API Dialogflow.
Entre em contato com seu representante do Google para garantir que sua conta tenha acesso a modelos aprimorados de conversão de discurso em texto.

Criar um perfil de conversa

Para criar um perfil de conversa, use o console do Google Assistente ou chame o método create no recurso ConversationProfile diretamente.

Para a transcrição do CCAI, recomendamos configurar ConversationProfile.stt_config como o InputAudioConfig padrão ao enviar dados de áudio em uma conversa.

Receber transcrições no momento da conversa

Para receber transcrições durante a conversa, você precisa criar participantes para a conversa e enviar dados de áudio para cada um deles.

Criar participantes

Há três tipos de participante. Consulte a documentação de referência para mais detalhes sobre os papéis. Chame o método create no participant e especifique o role. Apenas um participante END_USER ou HUMAN_AGENT pode chamar StreamingAnalyzeContent, que é necessário para receber uma transcrição.

Enviar dados de áudio e receber uma transcrição

É possível usar StreamingAnalyzeContent para enviar o áudio de um participante ao Google e receber a transcrição, com os seguintes parâmetros:

A primeira solicitação no stream precisa ser InputAudioConfig. Os campos configurados aqui substituem as configurações correspondentes em ConversationProfile.stt_config. Não envie nenhuma entrada de áudio até a segunda solicitação.
- audioEncoding precisa ser definido como AUDIO_ENCODING_LINEAR_16 ou AUDIO_ENCODING_MULAW.
- model: é o modelo de conversão de voz em texto que você quer usar para transcrever o áudio. Defina esse campo como telephony. A variante não afeta a qualidade da transcrição. Portanto, você pode deixar a Variante do modelo de fala sem especificação ou escolher Usar a melhor disponível.
- singleUtterance precisa ser definido como false para a melhor qualidade de transcrição. Não espere END_OF_SINGLE_UTTERANCE se singleUtterance for false, mas você pode depender de isFinal==true dentro de StreamingAnalyzeContentResponse.recognition_result para fechar parcialmente o stream.
- Parâmetros adicionais opcionais: os parâmetros a seguir são opcionais. Para acessar esses parâmetros, entre em contato com seu representante do Google.
  - languageCode: language_code do áudio. O valor padrão é en-US.
  - alternativeLanguageCodes: outros idiomas que podem ser detectados no áudio. O Assistente do agente usa o campo language_code para detectar automaticamente o idioma no início do áudio e se mantém nele em todas as próximas rodadas de conversa. O campo alternativeLanguageCodes permite especificar mais opções para o Assistente de agentes escolher.
  - phraseSets: o nome do recurso phraseSet da adaptação do modelo de conversão de voz em texto. Para usar a adaptação de modelo com a Transcrição CCAI, primeiro crie o phraseSet usando a API Speech-to-Text e especifique o nome do recurso aqui.
Depois de enviar a segunda solicitação com o payload de áudio, você vai começar a receber algumas StreamingAnalyzeContentResponses do stream.
- É possível fechar parcialmente o fluxo (ou parar de enviar em alguns idiomas, como Python) quando is_final estiver definido como true em StreamingAnalyzeContentResponse.recognition_result.
- Depois de fechar parcialmente o fluxo, o servidor vai enviar a resposta com a transcrição final, além de possíveis sugestões do Dialogflow ou do Assistente do Google.
Você pode encontrar a transcrição final nos seguintes locais:
- StreamingAnalyzeContentResponse.message.content
- Se você ativar as notificações do Pub/Sub, também poderá conferir a transcrição no Pub/Sub.
Inicie uma nova transmissão depois que a anterior for fechada.
- Reenvio de áudio: os dados de áudio gerados após o último speech_end_offset da resposta com is_final=true para o novo horário de início do stream precisam ser reenviados para StreamingAnalyzeContent para a melhor qualidade de transcrição.
Este diagrama ilustra como o stream funciona.

Exemplo de código de solicitação de reconhecimento de streaming

O exemplo de código a seguir ilustra como enviar uma solicitação de transcrição de streaming:

Python

Para autenticar no Assistente do agente, configure o Application Default Credentials. Para mais informações, consulte Configurar a autenticação para um ambiente de desenvolvimento local.

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Google Cloud Dialogflow API sample code using the StreamingAnalyzeContent
API.

Also please contact Google to get credentials of this project and set up the
credential file json locations by running:
export GOOGLE_APPLICATION_CREDENTIALS=<cred_json_file_location>

Example usage:
    export GOOGLE_CLOUD_PROJECT='cloud-contact-center-ext-demo'
    export CONVERSATION_PROFILE='FnuBYO8eTBWM8ep1i-eOng'
    export GOOGLE_APPLICATION_CREDENTIALS='/Users/ruogu/Desktop/keys/cloud-contact-center-ext-demo-78798f9f9254.json'
    python streaming_transcription.py

Then started to talk in English, you should see transcription shows up as you speak.

Say "Quit" or "Exit" to stop.
"""

import os
import re
import sys

from google.api_core.exceptions import DeadlineExceeded

import pyaudio

from six.moves import queue

import conversation_management
import participant_management

PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")
CONVERSATION_PROFILE_ID = os.getenv("CONVERSATION_PROFILE")

# Audio recording parameters
SAMPLE_RATE = 16000
CHUNK_SIZE = int(SAMPLE_RATE / 10)  # 100ms
RESTART_TIMEOUT = 160  # seconds
MAX_LOOKBACK = 3  # seconds

YELLOW = "\033[0;33m"


class ResumableMicrophoneStream:
    """Opens a recording stream as a generator yielding the audio chunks."""

    def __init__(self, rate, chunk_size):
        self._rate = rate
        self.chunk_size = chunk_size
        self._num_channels = 1
        self._buff = queue.Queue()
        self.is_final = False
        self.closed = True
        # Count the number of times the stream analyze content restarts.
        self.restart_counter = 0
        self.last_start_time = 0
        # Time end of the last is_final in millisec since last_start_time.
        self.is_final_offset = 0
        # Save the audio chunks generated from the start of the audio stream for
        # replay after restart.
        self.audio_input_chunks = []
        self.new_stream = True
        self._audio_interface = pyaudio.PyAudio()
        self._audio_stream = self._audio_interface.open(
            format=pyaudio.paInt16,
            channels=self._num_channels,
            rate=self._rate,
            input=True,
            frames_per_buffer=self.chunk_size,
            # Run the audio stream asynchronously to fill the buffer object.
            # This is necessary so that the input device's buffer doesn't
            # overflow while the calling thread makes network requests, etc.
            stream_callback=self._fill_buffer,
        )

    def __enter__(self):
        self.closed = False
        return self

    def __exit__(self, type, value, traceback):
        self._audio_stream.stop_stream()
        self._audio_stream.close()
        self.closed = True
        # Signal the generator to terminate so that the client's
        # streaming_recognize method will not block the process termination.
        self._buff.put(None)
        self._audio_interface.terminate()

    def _fill_buffer(self, in_data, *args, **kwargs):
        """Continuously collect data from the audio stream, into the buffer in
        chunksize."""

        self._buff.put(in_data)
        return None, pyaudio.paContinue

    def generator(self):
        """Stream Audio from microphone to API and to local buffer"""
        try:
            # Handle restart.
            print("restart generator")
            # Flip the bit of is_final so it can continue stream.
            self.is_final = False
            total_processed_time = self.last_start_time + self.is_final_offset
            processed_bytes_length = (
                int(total_processed_time * SAMPLE_RATE * 16 / 8) / 1000
            )
            self.last_start_time = total_processed_time
            # Send out bytes stored in self.audio_input_chunks that is after the
            # processed_bytes_length.
            if processed_bytes_length != 0:
                audio_bytes = b"".join(self.audio_input_chunks)
                # Lookback for unprocessed audio data.
                need_to_process_length = min(
                    int(len(audio_bytes) - processed_bytes_length),
                    int(MAX_LOOKBACK * SAMPLE_RATE * 16 / 8),
                )
                # Note that you need to explicitly use `int` type for substring.
                need_to_process_bytes = audio_bytes[(-1) * need_to_process_length :]
                yield need_to_process_bytes

            while not self.closed and not self.is_final:
                data = []
                # Use a blocking get() to ensure there's at least one chunk of
                # data, and stop iteration if the chunk is None, indicating the
                # end of the audio stream.
                chunk = self._buff.get()

                if chunk is None:
                    return
                data.append(chunk)
                # Now try to the rest of chunks if there are any left in the _buff.
                while True:
                    try:
                        chunk = self._buff.get(block=False)

                        if chunk is None:
                            return
                        data.append(chunk)

                    except queue.Empty:
                        break
                self.audio_input_chunks.extend(data)
                if data:
                    yield b"".join(data)
        finally:
            print("Stop generator")


def main():
    """start bidirectional streaming from microphone input to Dialogflow API"""
    # Create conversation.
    conversation = conversation_management.create_conversation(
        project_id=PROJECT_ID, conversation_profile_id=CONVERSATION_PROFILE_ID
    )

    conversation_id = conversation.name.split("conversations/")[1].rstrip()

    # Create end user participant.
    end_user = participant_management.create_participant(
        project_id=PROJECT_ID, conversation_id=conversation_id, role="END_USER"
    )
    participant_id = end_user.name.split("participants/")[1].rstrip()

    mic_manager = ResumableMicrophoneStream(SAMPLE_RATE, CHUNK_SIZE)
    print(mic_manager.chunk_size)
    sys.stdout.write(YELLOW)
    sys.stdout.write('\nListening, say "Quit" or "Exit" to stop.\n\n')
    sys.stdout.write("End (ms)       Transcript Results/Status\n")
    sys.stdout.write("=====================================================\n")

    with mic_manager as stream:
        while not stream.closed:
            terminate = False
            while not terminate:
                try:
                    print(f"New Streaming Analyze Request: {stream.restart_counter}")
                    stream.restart_counter += 1
                    # Send request to streaming and get response.
                    responses = participant_management.analyze_content_audio_stream(
                        conversation_id=conversation_id,
                        participant_id=participant_id,
                        sample_rate_herz=SAMPLE_RATE,
                        stream=stream,
                        timeout=RESTART_TIMEOUT,
                        language_code="en-US",
                        single_utterance=False,
                    )

                    # Now, print the final transcription responses to user.
                    for response in responses:
                        if response.message:
                            print(response)
                        if response.recognition_result.is_final:
                            print(response)
                            # offset return from recognition_result is relative
                            # to the beginning of audio stream.
                            offset = response.recognition_result.speech_end_offset
                            stream.is_final_offset = int(
                                offset.seconds * 1000 + offset.microseconds / 1000
                            )
                            transcript = response.recognition_result.transcript
                            # Half-close the stream with gRPC (in Python just stop yielding requests)
                            stream.is_final = True
                            # Exit recognition if any of the transcribed phrase could be
                            # one of our keywords.
                            if re.search(r"\b(exit|quit)\b", transcript, re.I):
                                sys.stdout.write(YELLOW)
                                sys.stdout.write("Exiting...\n")
                                terminate = True
                                stream.closed = True
                                break
                except DeadlineExceeded:
                    print("Deadline Exceeded, restarting.")

            if terminate:
                conversation_management.complete_conversation(
                    project_id=PROJECT_ID, conversation_id=conversation_id
                )
                break


if __name__ == "__main__":
    main()