Gemini-TTS

Text-to-Speech Gemini-TTS 是我们 Text-to-Speech 技术的最新发展,它不仅能生成自然流畅的音频,还能通过基于文本的提示对生成的音频进行精细控制。借助 Gemini-TTS,您可以将短片段到长篇叙事的文本合成为语音,精确控制风格、口音、语速、音调,甚至情感表达,所有这些都可以通过自然语言提示来引导。

以下产品支持 Gemini-TTS 功能:

  • gemini-2.5-flash-preview-tts:Gemini 2.5 Flash 预览版非常适合经济实惠的日常应用。

  • gemini-2.5-pro-preview-tts:Gemini 2.5 Pro 预览版非常适合可控的语音生成 (TTS),并且能够以出色的质量处理复杂的提示。

型号 优化目标 输入模态 输出模态 一位说话者
Gemini 2.5 Flash 预览版 TTS 低延迟、可控的单扬声器和多扬声器 Text-to-Speech 音频生成,适用于经济实惠的日常应用 文本 音频 ✔️
Gemini 2.5 Pro 预览版 TTS 高度控制,适用于播客生成、有声读物、客户支持等结构化工作流程 文本 音频 ✔️

其他控制措施和功能包括:

  1. 自然对话:语音互动质量出色,表达方式和韵律(节奏模式)更恰当,延迟非常低,因此您可以流畅地对话。

  2. 风格控制:使用自然语言提示,您可以引导对话采用特定口音,并生成各种语气和表达方式(包括耳语),从而调整对话中的表达方式。

  3. 动态表演:这些模型可以生动地朗读诗歌、新闻报道和精彩的故事,让文本焕发活力。它们还可以根据要求以特定情绪表演,并发出特定口音。

  4. 增强了语速和发音控制功能:控制朗读速度有助于确保发音更准确,包括特定字词。

示例

model: "gemini-2.5-pro-preview-tts"
prompt: "You are having a casual conversation with a friend. Say the following in a friendly and amused way."
text: "hahah I did NOT expect that. Can you believe it!."
speaker: "Callirhoe"

model: "gemini-2.5-flash-preview-tts"
prompt: "Say the following in a curious way"
text: "OK, so... tell me about this [uhm] AI thing.",
speaker: "Orus"

model: "gemini-2.5-flash-preview-tts"
prompt: "Say the following"
text: "[extremely fast] Availability and terms may vary. Check our website or your local store for complete details and restrictions."
speaker: "Kore"

如需详细了解如何以编程方式使用这些声音,请参阅使用 Gemini-TTS 部分。

语音选项

Gemini-TTS 提供各种语音选项,与我们现有的 Chirp 3:高清语音类似,每种选项都有不同的特点:

名称 性别 演示
Achernar
Achird
Algenib
Algieba
Alnilam
Aoede
Autonoe
Callirrhoe
冥卫一
Despina
土卫二
Erinome
Fenrir
Gacrux
土卫八
Kore
Laomedeia
Leda
Orus
Pulcherrima
Puck
Rasalgethi
Sadachbia
Sadaltager
Schedar
Sulafat
Umbriel
Vindemiatrix
Zephyr
Zubenelgenubi

支持的语言

Gemini-TTS 提供各种语音选项,与我们现有的 Chirp 3:高清语音类似,每种选项都有不同的特点:

语言 BCP-47 代码
英语(美国) en-US

区域级可用性

Gemini-TTS 模型已在以下 Google Cloud 区域推出:

Google Cloud 个区间 发布就绪情况
us 公开预览版

支持的输出格式

默认响应格式为 LINEAR16。其他支持的格式包括:

API 方法 格式
batch ALAW、MULAW、MP3、OGG_OPUS 和 PCM

使用 Gemini-TTS

了解如何使用 Gemini-TTS 模型合成单人语音。

执行同步语音合成请求

Python

# google-cloud-texttospeech minimum version 2.29.0 is required.

import os
from google.cloud import texttospeech

PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")

def synthesize(prompt: str, text: str, model_name: str, output_filepath: str = "output.mp3"):
   """Synthesizes speech from the input text and saves it to an MP3 file.

   Args:
       prompt: Stylisting instructions on how to synthesize the content in
         the text field.
       text: The text to synthesize.
       model_name: Gemini model to use. Currently, the available models are
         gemini-2.5-flash-preview-tts and gemini-2.5-pro-preview-tts
       output_filepath: The path to save the generated audio file.
         Defaults to "output.mp3".
   """
   client = texttospeech.TextToSpeechClient()

   synthesis_input = texttospeech.SynthesisInput(text=text, prompt=prompt)

   # Select the voice you want to use.
   voice = texttospeech.VoiceSelectionParams(
       language_code="en-US",
       name="Charon",  # Example voice, adjust as needed
       model_name=model_name
   )

   audio_config = texttospeech.AudioConfig(
       audio_encoding=texttospeech.AudioEncoding.MP3
   )

   # Perform the text-to-speech request on the text input with the selected
   # voice parameters and audio file type.
   response = client.synthesize_speech(
       input=synthesis_input, voice=voice, audio_config=audio_config
   )

   # The response's audio_content is binary.
   with open(output_filepath, "wb") as out:
       out.write(response.audio_content)
       print(f"Audio content written to file: {output_filepath}")

CURL

# Make sure to install gcloud cli, and sign in to your project.
# Make sure to use your PROJECT_ID value.
# Currently, the available models are gemini-2.5-flash-preview-tts and gemini-2.5-pro-preview-tts
# To parse the JSON output and use it directly see the last line of the command.
# Requires JQ and ffplay library to be installed.
PROJECT_ID=YOUR_PROJECT_ID
curl -X POST \
-H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
-H "x-goog-user-project: $PROJECT_ID" \
-H "Content-Type: application/json" \
-d '{
"input": {
  "prompt": "Say the following in a curious way",
  "text": "OK, so... tell me about this [uhm] AI thing."
},
"voice": {
  "languageCode": "en-us",
  "name": "Kore",
  "model_name": "gemini-2.5-flash-preview-tts"
},
"audioConfig": {
  "audioEncoding": "LINEAR16"
}
}' \
"https://texttospeech.googleapis.com/v1/text:synthesize" \
| jq -r '.audioContent' | base64 -d | ffplay - -autoexit