The Multimodal Live API enables low-latency bidirectional voice and video interactions with Gemini. Using the Multimodal Live API, you can provide end users with the experience of natural, human-like voice conversations, and with the ability to interrupt the model's responses using voice commands. The model can process text, audio, and video input, and it can provide text and audio output.
The Multimodal Live API is available in the Gemini API as the
BidiGenerateContent
method and is built on
WebSockets.
For more information, see the Multimodal Live API reference guide.
For a text-to-text example to help you get started with the Multimodal Live API, see the following:
from google import genai
client = genai.Client(http_options={'api_version': 'v1beta'})
model_id = "gemini-2.0-flash-exp"
config = {"response_modalities": ["TEXT"]}
async with client.aio.live.connect(model=model_id, config=config) as session:
message = "Hello? Gemini, are you there?"
print("> ", message, "\n")
await session.send(message, end_of_turn=True)
async for response in session.receive():
print(response.text)
Features:
- Audio input with audio output
- Audio and video input with audio output
- A selection of voices; see Multimodal Live API voices
- Session duration of up to 15 minutes for audio or up to 2 minutes of audio and video
To learn about additional capabilities of the Multimodal Live API, see Multimodal Live API capabilities.
Language:
- English only
Limitations: