此页面由 Cloud Translation API 翻译。

Multimodal Live API

Multimodal Live API 支持与 Gemini 建立低延迟的双向语音和视频互动。借助 Multimodal Live API，您可以为最终用户提供自然的、类似人类的语音对话体验，并能够使用语音指令中断模型的回答。该模型可以处理文本、音频和视频输入，并提供文本和音频输出。

Multimodal Live API 以 BidiGenerateContent 方法的形式在 Gemini API 中提供，并基于 WebSockets 构建。

如需了解详情，请参阅 Multimodal Live API 参考指南。

如需查看文本转文本示例，以帮助您开始使用 Multimodal Live API，请参阅以下内容：

Gen AI SDK for Python

安装

pip install --upgrade google-genai

如需了解详情，请参阅 SDK 参考文档。

设置环境变量以将 Gen AI SDK 与 Vertex AI 搭配使用：

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=us-central1
export GOOGLE_GENAI_USE_VERTEXAI=True

from google import genai
from google.genai.types import LiveConnectConfig, HttpOptions, Modality

client = genai.Client(http_options=HttpOptions(api_version="v1beta1"))
model_id = "gemini-2.0-flash-exp"

async with client.aio.live.connect(
    model=model_id,
    config=LiveConnectConfig(response_modalities=[Modality.TEXT]),
) as session:
    text_input = "Hello? Gemini, are you there?"
    print("> ", text_input, "\n")
    await session.send(input=text_input, end_of_turn=True)

    response = []

    async for message in session.receive():
        if message.text:
            response.append(message.text)

    print("".join(response))
# Example output:
# >  Hello? Gemini, are you there?
# Yes, I'm here. What would you like to talk about?

功能：

音频输入/音频输出
音频和视频输入以及音频输出
可供选择的语音；请参阅 Multimodal Live API 语音
会话时长：音频不超过 15 分钟，音频和视频不超过 2 分钟

如需了解 Multimodal Live API 的其他功能，请参阅 Multimodal Live API 功能。

语言：

仅支持英语

限制：

请参阅 Multimodal Live API 限制。

Multimodal Live API 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

Gen AI SDK for Python

安装

Multimodal Live API