Chirp 3: 즉석 커스텀 음성

Vertex AI Studio에서 즉석 커스텀 음성 사용해 보기 Colab에서 사용해 보기 GitHub에서 노트북 보기

Chirp 3의 즉석 커스텀 음성 기능을 사용하면 고품질 오디오 녹음으로 모델을 학습시켜 맞춤형 음성 모델을 만들 수 있습니다. 즉석 커스텀 음성을 사용하면 개인 음성을 빠르게 생성할 수 있습니다. 그런 다음 생성된 개인 음성을 사용하여 스트리밍 및 긴 형식 텍스트를 지원하는 Cloud Text-to-Speech API를 통해 오디오를 합성할 수 있습니다.

기술 세부정보

사용 가능한 언어	사용 가능한 언어를 참조하세요.
사용 가능한 리전	`global`, `us`, `eu`, `asia-southeast1`, `asia-northeast1`, `europe-west2`
지원되는 출력 형식	`streaming`: `LINEAR16`(기본값), `ALAW`, `MULAW`, `OGG_OPUS`, `PCM` `batch`: `LINEAR16`(기본값), `ALAW`, `MULAW`, `OGG_OPUS`, `PCM`
지원되는 인코딩 형식	`LINEAR16`, `PCM`, `MP3`, `M4A`
지원되는 기능	텍스트 기반 프롬프트: 구두점, 숨 돌리기, 머뭇거림을 사용하여 자연스러운 흐름과 속도를 더합니다 숨 돌리기 태그: (실험용) 합성 오디오에 주문형 숨 돌리기를 추가합니다. 속도 제어: 합성 오디오 속도를 0.25배속부터 2배속까지 조정합니다 발음 제어: (실험용) IPA 또는 X-SAMPA 음성 인코딩을 사용한 단어나 문구의 커스텀 발음입니다. 언어 전송: 언어가 `en-US`인 음성 클론 키는 `de-DE`, `es-US`, `es-ES`, `fr-CA`, `fr-FR`, `pt-BR` 언어로 출력을 합성할 수 있습니다.

사용 가능한 언어

즉석 커스텀 음성은 다음 언어에서 지원됩니다.

언어	BCP-47 코드	동의 문구
아랍어(XA)	ar-XA	.أنا مالك هذا الصوت وأوافق على أن تستخدم Google هذا الصوت لإنشاء نموذج صوتي اصطناعي
벵골어(인도)	bn-IN	আমি এই ভয়েসের মালিক এবং আমি একটি সিন্থেটিক ভয়েস মডেল তৈরি করতে এই ভয়েস ব্যবহার করে Google-এর সাথে সম্মতি দিচ্ছি।
중국어(중국)	cmn-CN	我是此声音的拥有者并授权谷歌使用此声音创建语音合成模型
영어(호주)	en-AU	I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.
영어(인도)	en-IN	I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.
영어(영국)	en-GB	I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.
영어(미국)	en-US	I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.
프랑스어(캐나다)	fr-CA	Je suis le propriétaire de cette voix et j'autorise Google à utiliser cette voix pour créer un modèle de voix synthétique.
프랑스어(프랑스)	fr-FR	Je suis le propriétaire de cette voix et j'autorise Google à utiliser cette voix pour créer un modèle de voix synthétique.
독일어(독일)	de-DE	Ich bin der Eigentümer dieser Stimme und bin damit einverstanden, dass Google diese Stimme zur Erstellung eines synthetischen Stimmmodells verwendet.
구자라트어(인도)	gu-IN	હું આ વોઈસનો માલિક છું અને સિન્થેટિક વોઈસ મોડલ બનાવવા માટે આ વોઈસનો ઉપયોગ કરીને google ને હું સંમતિ આપું છું
힌디어(인도)	hi-IN	मैं इस आवाज का मालिक हूं और मैं सिंथेटिक आवाज मॉडल बनाने के लिए Google को इस आवाज का उपयोग करने की सहमति देता हूं
인도네시아어(인도네시아)	id-ID	Saya pemilik suara ini dan saya menyetujui Google menggunakan suara ini untuk membuat model suara sintetis.
이탈리아어(이탈리아)	it-IT	Sono il proprietario di questa voce e acconsento che Google la utilizzi per creare un modello di voce sintetica.
일본어(일본)	ja-JP	私はこの音声の所有者であり、Googleがこの音声を使用して音声合成モデルを作成することを承認します。
칸나다어(인도)	kn-IN	ನಾನು ಈ ಧ್ವನಿಯ ಮಾಲಿಕ ಮತ್ತು ಸಂಶ್ಲೇಷಿತ ಧ್ವನಿ ಮಾದರಿಯನ್ನು ರಚಿಸಲು ಈ ಧ್ವನಿಯನ್ನು ಬಳಸಿಕೊಂಡುಗೂಗಲ್ ಗೆ ನಾನು ಸಮ್ಮತಿಸುತ್ತೇನೆ.
한국어(대한민국)	ko-KR	나는 이 음성의 소유자이며 구글이 이 음성을 사용하여 음성 합성 모델을 생성할 것을 허용합니다.
말라얄람어(인도)	ml-IN	ഈ ശബ്ദത്തിന്റെ ഉടമ ഞാനാണ്, ഒരു സിന്തറ്റിക് വോയ്‌ಸ್ മോഡൽ സൃഷ്ടിക്കാൻ ഈ ശബ്‌ദം ഉപയോഗിക്കുന്നതിന് ഞാൻ Google-ന് സമ്മതം നൽകുന്നു."
마라티어(인도)	mr-IN	मी या आवाजाचा मालक आहे आणि सिंथेटिक व्हॉइस मॉडेल तयार करण्यासाठी हा आवाज वापरण्यासाठी मी Google ला संमती देतो
네덜란드어(네덜란드)	nl-NL	Ik ben de eigenaar van deze stem en ik geef Google toestemming om deze stem te gebruiken om een synthetisch stemmodel te maken.
폴란드어(폴란드)	pl-PL	Jestem właścicielem tego głosu i wyrażam zgodę na wykorzystanie go przez Google w celu utworzenia syntetycznego modelu głosu.
포르투갈어(브라질)	pt-BR	Eu sou o proprietário desta voz e autorizo o Google a usá-la para criar um modelo de voz sintética.
러시아어(러시아)	ru-RU	Я являюсь владельцем этого голоса и даю согласие Google на использование этого голоса для создания модели синтетического голоса.
타밀어(인도)	ta-IN	நான் இந்த குரலின் உரிமையாளர் மற்றும் செயற்கை குரல் மாதிரியை உருவாக்க இந்த குரலை பயன்படுத்த குகல்க்கு நான் ஒப்புக்கொள்கிறேன்.
텔루구어(인도)	te-IN	నేను ఈ వాయిస్ యజమానిని మరియు సింతటిక్ వాయిస్ మోడల్ ని రూపొందించడానికి ఈ వాయిస్ ని ఉపయోగించడానికి googleకి నేను సమ్మతిస్తున్నాను.
태국어(태국)	th-TH	ฉันเป็นเจ้าของเสียงนี้ และฉันยินยอมให้ Google ใช้เสียงนี้เพื่อสร้างแบบจำลองเสียงสังเคราะห์
터키어(터키)	tr-TR	Bu sesin sahibi benim ve Google'ın bu sesi kullanarak sentetik bir ses modeli oluşturmasına izin veriyorum.
베트남어(베트남)	vi-VN	Tôi là chủ sở hữu giọng nói này và tôi đồng ý cho Google sử dụng giọng nói này để tạo mô hình giọng nói tổng hợp.
스페인어(스페인)	es-ES	Soy el propietario de esta voz y doy mi consentimiento para que Google la utilice para crear un modelo de voz sintética.
스페인어(미국)	es-US	Soy el propietario de esta voz y doy mi consentimiento para que Google la utilice para crear un modelo de voz sintética.

즉석 커스텀 음성 사용

다음 섹션에서는 Text-to-Speech API에서 Chirp 3: 즉석 커스텀 음성 기능을 사용하는 방법을 알아봅니다.

동의 및 참조 오디오 녹음

동의 문구 녹음: 즉석 커스텀 음성에 대한 법적 및 윤리적 가이드라인을 준수하려면 필요한 동의 문구를 해당 언어와 지원되는 오디오 인코딩으로 최대 10초 길이의 단일 채널 오디오 파일로 녹음합니다. ('저는 이 음성의 소유자이며 Google에서 이 음성을 사용하여 합성 음성 모델을 만드는 데 동의합니다.')
참조 오디오 녹음: 컴퓨터 마이크를 사용하여 지원되는 오디오 인코딩으로 최대 10초 길이의 오디오를 단일 채널 오디오 파일로 녹음합니다. 녹음 중에는 배경 소음이 없어야 합니다. 동의 오디오와 참조 오디오를 같은 환경에서 녹음합니다.
오디오 파일 저장: 녹음된 오디오 파일을 지정된 Cloud Storage 위치에 저장합니다.

고품질 참조 및 동의 오디오 제작 가이드라인

고품질 참조 오디오와 동의 오디오를 제작하려면 다음 가이드라인을 따르세요.

오디오 길이는 최대한 10초에 가까워야 합니다.
오디오는 자연스러운 숨 돌리기와 속도를 포함해야 합니다.
오디오에는 배경 소음이 최소화되어야 합니다.
자세한 내용은 지원되는 오디오 인코딩을 참조하세요. 모든 샘플링 레이트를 사용할 수 있습니다.
모델은 마이크 품질을 복제하므로 녹음 음성이 선명하지 않으면 출력 음성도 선명하지 않습니다.
음성은 최종 출력의 음성보다 역동적이고 표현력이 풍부해야 합니다. 또한 음성에는 클론된 음성에 포함될 운율이 있어야 합니다. 예를 들어 참조 오디오에 자연스러운 숨 돌리기나 중단이 없으면 클론된 음성에서 숨 돌리기가 부적절해 집니다.
좋은 프롬프트는 단조롭고 지루한 것보다 더 신나고 활기차서 모델이 이 에너지를 복제하는 단서를 얻을 수 있습니다.

REST API를 사용하여 즉석 커스텀 음성 만들기

즉석 커스텀 음성은 음성 클론 키의 형태를 취하며, 이는 음성 데이터의 텍스트 문자열 표현입니다.

유의할 핵심 사항

커스텀 음성을 만들 때 알아두어야 할 몇 가지 중요한 사항은 다음과 같습니다.

음성 클론 키는 클라이언트 측에 저장되고 요청별로 제공되므로 수에 관계없이 음성 클론 키를 만들 수 있습니다.
동일한 음성 클론 키는 여러 클라이언트 또는 기기에서 동시에 사용할 수 있습니다.
프로젝트당 분당 10개의 음성 클론 키를 만들 수 있습니다. 자세한 내용은 요청 한도를 참조하세요.
기본 스크립트 대신 커스텀 동의 스크립트를 사용할 수 없습니다. 선택한 언어에 제공된 동의 문구 스크립트를 사용해야 합니다.

import requests, os, json

def create_instant_custom_voice_key(
    access_token, project_id, reference_audio_bytes, consent_audio_bytes
):
    url = "https://texttospeech.googleapis.com/v1beta1/voices:generateVoiceCloningKey"

    request_body = {
        "reference_audio": {
            # Supported audio_encoding values are LINEAR16, PCM, MP3, and M4A.
            "audio_config": {"audio_encoding": "LINEAR16"},
            "content": reference_audio_bytes,
        },
        "voice_talent_consent": {
            # Supported audio_encoding values are LINEAR16, PCM, MP3, and M4A.
            "audio_config": {"audio_encoding": "LINEAR16"},
            "content": consent_audio_bytes,
        },
        "consent_script": "I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.",
        "language_code": "en-US",
    }

    try:
        headers = {
            "Authorization": f"Bearer {access_token}",
            "x-goog-user-project": project_id,
            "Content-Type": "application/json; charset=utf-8",
        }

        response = requests.post(url, headers=headers, json=request_body)
        response.raise_for_status()

        response_json = response.json()
        return response_json.get("voiceCloningKey")

    except requests.exceptions.RequestException as e:
        print(f"Error making API request: {e}")
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON response: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

REST API를 사용하여 즉석 커스텀 음성으로 합성

음성 클론 키를 사용하여 REST API로 오디오를 합성합니다.

import requests, os, json, base64
from IPython.display import Audio, display

def synthesize_text_with_cloned_voice(access_token, project_id, voice_key, text):
    url = "https://texttospeech.googleapis.com/v1beta1/text:synthesize"

    request_body = {
        "input": {
            "text": text
        },
        "voice": {
            "language_code": "en-US",
            "voice_clone": {
                "voice_cloning_key": voice_key,
            }
        },
        "audioConfig": {
            # Supported audio_encoding values are LINEAR16, PCM, MP3, and M4A.
            "audioEncoding": "LINEAR16",
        }
    }

    try:
        headers = {
            "Authorization": f"Bearer {access_token}",
            "x-goog-user-project": project_id,
            "Content-Type": "application/json; charset=utf-8"
        }

        response = requests.post(url, headers=headers, json=request_body)
        response.raise_for_status()

        response_json = response.json()
        audio_content = response_json.get("audioContent")

        if audio_content:
            display(Audio(base64.b64decode(audio_content), rate=24000))
        else:
            print("Error: Audio content not found in the response.")
            print(response_json)

    except requests.exceptions.RequestException as e:
        print(f"Error making API request: {e}")
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON response: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

Python 클라이언트 라이브러리를 사용하여 즉석 커스텀 음성으로 합성

이 예시에서는 Python 클라이언트 라이브러리를 사용하여 voice_cloning_key.txt 파일에 저장된 음성 클론 키를 통해 즉석 커스텀 음성을 합성합니다. 음성 클론 키를 생성하려면 REST API를 사용하여 즉석 커스텀 음성 만들기를 참조하세요.

from google.cloud import texttospeech
from google.cloud.texttospeech_v1beta1.services.text_to_speech import client


def perform_voice_cloning(
    voice_cloning_key: str,
    transcript: str,
    language_code: str,
    synthesis_output_path: str,
    tts_client: client.TextToSpeechClient,
) -> None:
  """Perform voice cloning and write output to a file.

  Args:
    voice_cloning_key: The voice cloning key.
    transcript: The transcript to synthesize.
    language_code: The language code.
    synthesis_output_path: The synthesis audio output path.
    tts_client: The TTS client to use.
  """
  voice_clone_params = texttospeech.VoiceCloneParams(
      voice_cloning_key=voice_cloning_key
  )
  voice = texttospeech.VoiceSelectionParams(
      language_code=language_code, voice_clone=voice_clone_params
  )
  request = texttospeech.SynthesizeSpeechRequest(
      input=texttospeech.SynthesisInput(text=transcript),
      voice=voice,
      audio_config=texttospeech.AudioConfig(
          audio_encoding=texttospeech.AudioEncoding.LINEAR16,
          sample_rate_hertz=24000,
      ),
  )
  response = tts_client.synthesize_speech(request)
  with open(synthesis_output_path, 'wb') as out:
    out.write(response.audio_content)
    print(f'Audio content written to file {synthesis_output_path}.')


if __name__ == '__main__':
  client = texttospeech.TextToSpeechClient()
  with open('voice_cloning_key.txt', 'r') as f:
    key = f.read()
  perform_voice_cloning(
      voice_cloning_key=key,
      transcript='Hello world!',
      language_code='en-US',
      synthesis_output_path='/tmp/output.wav',
      tts_client=client,
  )

Python 클라이언트 라이브러리를 사용하여 즉석 커스텀 음성으로 스트리밍 합성

이 예시에서는 Python 클라이언트 라이브러리를 사용하여 voice_cloning_key.txt에 저장된 음성 클론 키를 통해 즉석 커스텀 음성 스트리밍을 합성합니다. 음성 클론 키를 생성하려면 REST API를 사용하여 즉석 커스텀 음성 만들기를 참조하세요.

import io
import wave
from google.cloud import texttospeech
from google.cloud.texttospeech_v1beta1.services.text_to_speech import client


def perform_voice_cloning_with_simulated_streaming(
    voice_cloning_key: str,
    simulated_streamed_text: list[str],
    language_code: str,
    synthesis_output_path: str,
    tts_client: client.TextToSpeechClient,
) -> None:
  """Perform voice cloning for a given reference audio, voice talent consent, and consent script.

  Args:
    voice_cloning_key: The voice cloning key.
    simulated_streamed_text: The list of transcripts to synthesize, where each
      item represents a chunk of streamed text. This is used to simulate
      streamed text input and is not meant to be representative of real-world
      streaming usage.
    language_code: The language code.
    synthesis_output_path: The path to write the synthesis audio output to.
    tts_client: The TTS client to use.
  """
  voice_clone_params = texttospeech.VoiceCloneParams(
      voice_cloning_key=voice_cloning_key
  )
  streaming_config = texttospeech.StreamingSynthesizeConfig(
      voice=texttospeech.VoiceSelectionParams(
          language_code=language_code, voice_clone=voice_clone_params
      ),
      streaming_audio_config=texttospeech.StreamingAudioConfig(
          audio_encoding=texttospeech.AudioEncoding.PCM,
          sample_rate_hertz=24000,
      ),
  )
  config_request = texttospeech.StreamingSynthesizeRequest(
      streaming_config=streaming_config
  )

  # Request generator. Consider using Gemini or another LLM with output
  # streaming as a generator.
  def request_generator():
    yield config_request
    for text in simulated_streamed_text:
      yield texttospeech.StreamingSynthesizeRequest(
          input=texttospeech.StreamingSynthesisInput(text=text)
      )

  streaming_responses = tts_client.streaming_synthesize(request_generator())
  audio_buffer = io.BytesIO()
  for response in streaming_responses:
    print(f'Audio content size in bytes is: {len(response.audio_content)}')
    audio_buffer.write(response.audio_content)

  # Write collected audio outputs to a WAV file.
  with wave.open(synthesis_output_path, 'wb') as wav_file:
    wav_file.setnchannels(1)
    wav_file.setsampwidth(2)
    wav_file.setframerate(24000)
    wav_file.writeframes(audio_buffer.getvalue())
    print(f'Audio content written to file {synthesis_output_path}.')


if __name__ == '__main__':
  client = texttospeech.TextToSpeechClient()
  with open('voice_cloning_key.txt', 'r') as f:
    key = f.read()
  perform_voice_cloning_with_simulated_streaming(
      voice_cloning_key=key,
      simulated_streamed_text=[
          'Hello world!',
          'This is the second text chunk.',
          'This simulates streaming text for synthesis.',
      ],
      language_code='en-US',
      synthesis_output_path='streaming_output.wav',
      tts_client=client,
  )

Chirp 3: HD 음성 제어 사용

즉석 커스텀 음성은 Chirp 3: HD 음성에서 지원하는 기능과 동일한 속도 제어, 숨 돌리기 제어, 커스텀 발음 기능을 지원합니다. Chirp 3: HD 음성 제어에 대한 자세한 내용은 Chirp 3: HD 음성 제어를 참조하세요.

즉석 커스텀 음성에 수행하는 방식과 동일한 방식으로 SynthesizeSpeechRequest 또는 StreamingSynthesizeConfig를 조정하여 세 가지 기능 모두 즉석 커스텀 음성에 사용 설정할 수 있습니다.

음성 제어 지원 언어

속도 제어는 모든 언어에서 사용할 수 있습니다.
일시중지 제어는 모든 언어에서 사용할 수 있습니다.
커스텀 발음은 bn-IN, gu-IN, th-TH, vi-VN을 제외한 모든 언어에서 사용할 수 있습니다.

다국어 전송 사용 설정

즉석 커스텀 음성은 지정된 언어 쌍에 대한 다국어 전송을 지원합니다. 즉, en-US와 같은 특정 언어 코드로 생성된 음성 클론 키가 있으면 이 키를 사용하여 es-ES와 같은 다른 언어로 언어를 합성할 수 있습니다.

이 코드 샘플은 en-US 음성 클론 키를 사용하여 es-ES를 합성하도록 SynthesizeRequest를 구성하는 방법을 보여줍니다.

voice_clone_params = texttospeech.VoiceCloneParams(
    voice_cloning_key=en_us_voice_cloning_key
)
request = texttospeech.SynthesizeSpeechRequest(
  input=texttospeech.SynthesisInput(text=transcript),
  voice=texttospeech.VoiceSelectionParams(
      language_code='es-ES', voice_clone=voice_clone_params
  ),
  audio_config=texttospeech.AudioConfig(
      audio_encoding=texttospeech.AudioEncoding.LINEAR16,
      sample_rate_hertz=24000,
  ),
)

en-US 음성 클론 키를 사용하여 es-ES를 합성하도록 StreamingSynthesizeConfig를 구성하는 예:

voice_clone_params = texttospeech.VoiceCloneParams(
    voice_cloning_key=en_us_voice_cloning_key
)
streaming_config = texttospeech.StreamingSynthesizeConfig(
    voice=texttospeech.VoiceSelectionParams(
        language_code='es-ES', voice_clone=voice_clone_params
    ),
    streaming_audio_config=texttospeech.StreamingAudioConfig(
        audio_encoding=texttospeech.AudioEncoding.PCM,
        sample_rate_hertz=24000,
    ),
)