The Multimodal Live API enables low-latency bidirectional voice and video interactions with Gemini. Using the Multimodal Live API, you can provide end users with the experience of natural, human-like voice conversations, and with the ability to interrupt the model's responses using voice commands. The model can process text, audio, and video input, and it can provide text and audio output.
The Multimodal Live API is available in the Gemini API as the
BidiGenerateContent
method and is built on
WebSockets.
For more information, see the Multimodal Live API reference guide.
For a text-to-text example to help you get started with the Multimodal Live API, see the following:
Gen AI SDK for Python
Learn how to install or update the Google Gen AI SDK for Python.
For more information, see the
Gen AI SDK for Python API reference documentation or the
python-genai
GitHub repository.
Set environment variables to use the Gen AI SDK with Vertex AI:
# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values # with appropriate values for your project. export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT export GOOGLE_CLOUD_LOCATION=us-central1 export GOOGLE_GENAI_USE_VERTEXAI=True
Features:
- Audio input with audio output
- Audio and video input with audio output
- A selection of voices; see Multimodal Live API voices
- Session duration of up to 15 minutes for audio or up to 2 minutes of audio and video
To learn about additional capabilities of the Multimodal Live API, see Multimodal Live API capabilities.
Language:
- English only
Limitations: