本頁面由 Cloud Translation API 翻譯而成。

Live API

透過 Live API，您可以與 Gemini 展開低延遲的雙向語音/視訊互動。使用 Live API 為使用者提供自然、類似人類語音的對話，包括使用語音指令中斷模型回應的功能。

本文將介紹 Live API 的基本用法，包括功能、入門範例和基本用途的程式碼範例。如要瞭解如何使用 Live API 展開互動式對話，請參閱「使用 Live API 進行互動式對話」。如要瞭解 Live API 可使用的工具，請參閱「內建工具」。

在 Vertex AI 中試用

支援的模型

Google Gen AI SDK 和 Vertex AI Studio 都支援使用 Live API。部分功能 (例如文字輸入和輸出) 僅適用於 Gen AI SDK。

您可以在下列機型上使用 Live API：

模型版本	可用性等級
`gemini-live-2.5-flash`	私人搶先體驗版^*
`gemini-live-2.5-flash-preview-native-audio-09-2025`	公開預先發布版
`gemini-live-2.5-flash-preview-native-audio`	公開測試版；終止日期：2025 年 10 月 18 日

^* 請與 Google 帳戶團隊代表聯絡，要求存取權。

如要進一步瞭解技術規格和限制，請參閱 Live API 參考資料指南。

Live API 功能

即時多模態理解：透過內建的音訊和影片串流支援功能，與 Gemini 對話，討論影片串流或螢幕分享畫面中的內容。
內建工具使用：將函式呼叫和
低延遲互動：與 Gemini 進行低延遲的互動，就像與真人對話一樣。
支援多種語言： 支援 24 種語言。
(僅限正式發布版) 支援佈建的處理量：使用固定費用、固定期限的訂閱方案，有多種期限長度可供選擇，可為 Vertex AI 上支援的生成式 AI 模型 (包括 Live API) 預留處理量。
高品質轉錄：Live API 支援輸入和輸出音訊的文字轉錄。

Gemini 2.5 Flash 搭配 Live API 也推出原生音訊公開預先發布版。原生音訊推出以下功能：

情感對話：Live API 可理解並回應使用者的語氣。以不同方式說出相同字詞，可能會產生截然不同且更細膩的對話。
主動式音訊和情境感知： Live API 會智慧地忽略環境對話和其他無關的音訊，瞭解何時應聆聽，何時應保持靜音。

如要進一步瞭解原生音訊，請參閱「內建工具」。

支援的音訊格式

Live API 支援下列音訊格式：

輸入音訊：16 kHz 的原始 16 位元 PCM 音訊，小端序
輸出音訊：24 kHz 的原始 16 位元 PCM 音訊，小端序

支援的視訊格式

Live API 支援以 1 FPS 的速度輸入視訊影格。為獲得最佳效果，請使用 768x768 的原始解析度，並將影格速率設為 1 FPS。

啟動條件範例

你可以透過下列其中一個筆記本教學課程、示範應用程式或指南，開始使用 Live API。

筆記本教學課程

從 GitHub 下載這些筆記本教學課程，或在您選擇的環境中開啟筆記本教學課程。

搭配 Live API 使用 WebSocket

串流音訊和影片

示範應用程式和指南

其他範例

如要進一步運用 Live API，請試試下列範例，瞭解如何使用 Live API 的音訊處理、轉錄和語音回覆功能。

透過語音輸入取得文字回覆

您可以將音訊轉換為 16 位元 PCM、16 kHz 單聲道格式，然後傳送音訊並接收文字回覆。以下範例會讀取 WAV 檔案，並以正確格式傳送：

Python

# Test file: https://storage.googleapis.com/generativeai-downloads/data/16000.wav
# Install helpers for converting files: pip install librosa soundfile

import asyncio
import io
from pathlib import Path
from google import genai
from google.genai import types
import soundfile as sf
import librosa

client = genai.Client(
    vertexai=True,
    project=GOOGLE_CLOUD_PROJECT,
    location=GOOGLE_CLOUD_LOCATION,
)
model = "gemini-live-2.5-flash"
config = {"response_modalities": ["TEXT"]}

async def main():
    async with client.aio.live.connect(model=model, config=config) as session:

        buffer = io.BytesIO()
        y, sr = librosa.load("sample.wav", sr=16000)
        sf.write(buffer, y, sr, format="RAW", subtype="PCM_16")
        buffer.seek(0)
        audio_bytes = buffer.read()

        # If already in correct format, you can use this:
        # audio_bytes = Path("sample.pcm").read_bytes()

        await session.send_realtime_input(
            audio=types.Blob(data=audio_bytes, mime_type="audio/pcm;rate=16000")
        )

        async for response in session.receive():
            if response.text is not None:
                print(response.text)

if __name__ == "__main__":
    asyncio.run(main())

透過語音取得文字輸入內容的回覆

使用這個範例傳送文字輸入內容，並接收合成語音回覆：

Python

import asyncio
import numpy as np
from IPython.display import Audio, Markdown, display
from google import genai
from google.genai.types import (
  Content,
  LiveConnectConfig,
  HttpOptions,
  Modality,
  Part,
  SpeechConfig,
  VoiceConfig,
  PrebuiltVoiceConfig,
)

client = genai.Client(
  vertexai=True,
  project=GOOGLE_CLOUD_PROJECT,
  location=GOOGLE_CLOUD_LOCATION,
)

voice_name = "Aoede"

config = LiveConnectConfig(
  response_modalities=["AUDIO"],
  speech_config=SpeechConfig(
      voice_config=VoiceConfig(
          prebuilt_voice_config=PrebuiltVoiceConfig(
              voice_name=voice_name,
          )
      ),
  ),
)

async with client.aio.live.connect(
  model="gemini-live-2.5-flash",
  config=config,
) as session:
  text_input = "Hello? Gemini are you there?"
  display(Markdown(f"**Input:** {text_input}"))

  await session.send_client_content(
      turns=Content(role="user", parts=[Part(text=text_input)]))

  audio_data = []
  async for message in session.receive():
      if (
          message.server_content.model_turn
          and message.server_content.model_turn.parts
      ):
          for part in message.server_content.model_turn.parts:
              if part.inline_data:
                  audio_data.append(
                      np.frombuffer(part.inline_data.data, dtype=np.int16)
                  )

  if audio_data:
      display(Audio(np.concatenate(audio_data), rate=24000, autoplay=True))

如需傳送文字的其他範例，請參閱入門指南。

轉錄音訊內容

Live API 可轉錄輸入和輸出音訊。使用下列範例啟用轉錄功能：

Python

import asyncio
from google import genai
from google.genai import types

client = genai.Client(
    vertexai=True,
    project=GOOGLE_CLOUD_PROJECT,
    location=GOOGLE_CLOUD_LOCATION,
)
model = "gemini-live-2.5-flash"

config = {
    "response_modalities": ["AUDIO"],
    "input_audio_transcription": {},
    "output_audio_transcription": {}
}

async def main():
    async with client.aio.live.connect(model=model, config=config) as session:
        message = "Hello? Gemini are you there?"

        await session.send_client_content(
            turns={"role": "user", "parts": [{"text": message}]}, turn_complete=True
        )

        async for response in session.receive():
            if response.server_content.model_turn:
                print("Model turn:", response.server_content.model_turn)
            if response.server_content.input_transcription:
                print("Input transcript:", response.server_content.input_transcription.text)
            if response.server_content.output_transcription:
                print("Output transcript:", response.server_content.output_transcription.text)

if __name__ == "__main__":
    asyncio.run(main())

WebSocket

# Set model generation_config
CONFIG = {
    'response_modalities': ['AUDIO'],
}

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {bearer_token[0]}",
}

# Connect to the server
async with connect(SERVICE_URL, additional_headers=headers) as ws:
    # Setup the session
    await ws.send(
        json.dumps(
            {
                "setup": {
                    "model": "gemini-2.0-flash-live-preview-04-09",
                    "generation_config": CONFIG,
                    'input_audio_transcription': {},
                    'output_audio_transcription': {}
                }
            }
        )
    )

    # Receive setup response
    raw_response = await ws.recv(decode=False)
    setup_response = json.loads(raw_response.decode("ascii"))

    # Send text message
    text_input = "Hello? Gemini are you there?"
    display(Markdown(f"**Input:** {text_input}"))

    msg = {
        "client_content": {
            "turns": [{"role": "user", "parts": [{"text": text_input}]}],
            "turn_complete": True,
        }
    }

    await ws.send(json.dumps(msg))

    responses = []
    input_transcriptions = []
    output_transcriptions = []

    # Receive chucks of server response
    async for raw_response in ws:
        response = json.loads(raw_response.decode())
        server_content = response.pop("serverContent", None)
        if server_content is None:
            break

        if (input_transcription := server_content.get("inputTranscription")) is not None:
            if (text := input_transcription.get("text")) is not None:
                input_transcriptions.append(text)
        if (output_transcription := server_content.get("outputTranscription")) is not None:
            if (text := output_transcription.get("text")) is not None:
                output_transcriptions.append(text)

        model_turn = server_content.pop("modelTurn", None)
        if model_turn is not None:
            parts = model_turn.pop("parts", None)
            if parts is not None:
                for part in parts:
                    pcm_data = base64.b64decode(part["inlineData"]["data"])
                    responses.append(np.frombuffer(pcm_data, dtype=np.int16))

        # End of turn
        turn_complete = server_content.pop("turnComplete", None)
        if turn_complete:
            break

    if input_transcriptions:
        display(Markdown(f"**Input transcription >** {''.join(input_transcriptions)}"))

    if responses:
        # Play the returned audio message
        display(Audio(np.concatenate(responses), rate=24000, autoplay=True))

    if output_transcriptions:
        display(Markdown(f"**Output transcription >** {''.join(output_transcriptions)}"))

Live API 轉錄服務的價格取決於輸出文字的權杖數量。詳情請參閱 Vertex AI 定價頁面。

Live API

支援的模型

Live API 功能

支援的音訊格式

支援的視訊格式

啟動條件範例

筆記本教學課程

搭配 Live API 使用 WebSocket

串流音訊和影片

示範應用程式和指南

其他範例

透過語音輸入取得文字回覆

Python

透過語音取得文字輸入內容的回覆

Python

轉錄音訊內容

Python

WebSocket

更多資訊