本頁面由 Cloud Translation API 翻譯而成。

音訊理解 (僅限語音)

你可以在 Gemini 要求中加入音訊，讓 Gemini 執行需要瞭解音訊內容的工作。本頁面說明如何使用Google Cloud 控制台和 Vertex AI API，在傳送至 Vertex AI 中 Gemini 的要求中加入音訊。

支援的模型

下表列出支援音訊理解功能的模型：

型號	媒體詳細資料	MIME 類型
Gemini 2.5 Flash (預先發布版)	每個提示的音訊長度上限：約 8.4 小時，或最多 100 萬個權杖每個提示的音訊檔案數量上限： 1 語音理解：音訊摘要、轉錄和翻譯	`audio/x-aac` `audio/flac` `audio/mp3` `audio/m4a` `audio/mpeg` `audio/mpga` `audio/mp4` `audio/ogg` `audio/pcm` `audio/wav` `audio/webm`
Gemini 2.5 Flash-Lite (預先發布版)	每個提示的音訊長度上限：約 8.4 小時，或最多 100 萬個權杖每個提示的音訊檔案數量上限： 1	`audio/x-aac` `audio/flac` `audio/mp3` `audio/m4a` `audio/mpeg` `audio/mpga` `audio/mp4` `audio/ogg` `audio/pcm` `audio/wav` `audio/webm`
Gemini 2.5 Flash-Lite	每個提示的音訊長度上限：約 8.4 小時，或最多 100 萬個權杖每個提示的音訊檔案數量上限： 1	`audio/x-aac` `audio/flac` `audio/mp3` `audio/m4a` `audio/mpeg` `audio/mpga` `audio/mp4` `audio/ogg` `audio/pcm` `audio/wav` `audio/webm`
Gemini 2.5 Flash，支援 Live API 原生音訊 (預先發布版)	對話長度上限：預設為 10 分鐘，可延長。必要音訊輸入格式： 16 kHz 的原始 16 位元 PCM 音訊，小端序必要音訊輸出格式： 24 kHz 的原始 16 位元 PCM 音訊 (小端序)	`audio/x-aac` `audio/flac` `audio/mp3` `audio/m4a` `audio/mpeg` `audio/mpga` `audio/mp4` `audio/ogg` `audio/pcm` `audio/wav` `audio/webm`
Gemini 2.0 Flash with Live API (預先發布版)	每個提示的音訊長度上限：約 8.4 小時，或最多 100 萬個權杖每個提示的音訊檔案數量上限： 1 語音理解：音訊摘要、轉錄和翻譯每分鐘最多權杖數 (TPM)：美國/亞洲： 170 萬歐盟： 0.4 M	`audio/x-aac` `audio/flac` `audio/mp3` `audio/m4a` `audio/mpeg` `audio/mpga` `audio/mp4` `audio/ogg` `audio/pcm` `audio/wav` `audio/webm`
Gemini 2.0 Flash (可生成圖片) (預先發布版)	每個提示的音訊長度上限：約 8.4 小時，或最多 100 萬個權杖每個提示的音訊檔案數量上限： 1 語音理解：音訊摘要、轉錄和翻譯每分鐘最多權杖數 (TPM)：美國/亞洲： 170 萬歐盟： 0.4 M
Gemini 2.5 Pro	每個提示的音訊長度上限：約 8.4 小時，或最多 100 萬個權杖每個提示的音訊檔案數量上限： 1 語音理解：音訊摘要、轉錄和翻譯	`audio/x-aac` `audio/flac` `audio/mp3` `audio/m4a` `audio/mpeg` `audio/mpga` `audio/mp4` `audio/ogg` `audio/pcm` `audio/wav` `audio/webm`
Gemini 2.5 Flash	每個提示的音訊長度上限：約 8.4 小時，或最多 100 萬個權杖每個提示的音訊檔案數量上限： 1 語音理解：音訊摘要、轉錄和翻譯	`audio/x-aac` `audio/flac` `audio/mp3` `audio/m4a` `audio/mpeg` `audio/mpga` `audio/mp4` `audio/ogg` `audio/pcm` `audio/wav` `audio/webm`
Gemini 2.0 Flash	每個提示的音訊長度上限：約 8.4 小時，或最多 100 萬個權杖每個提示的音訊檔案數量上限： 1 語音理解：音訊摘要、轉錄和翻譯每分鐘最多權杖數 (TPM)：美國/亞洲： 350 萬歐盟： 350 萬	`audio/x-aac` `audio/flac` `audio/mp3` `audio/m4a` `audio/mpeg` `audio/mpga` `audio/mp4` `audio/ogg` `audio/pcm` `audio/wav` `audio/webm`
Gemini 2.0 Flash-Lite	每個提示的音訊長度上限：約 8.4 小時，或最多 100 萬個權杖每個提示的音訊檔案數量上限： 1 語音理解：音訊摘要、轉錄和翻譯每分鐘最多權杖數 (TPM)：美國/亞洲： 350 萬歐盟： 350 萬

配額指標為 generate_content_audio_input_per_base_model_id_and_resolution。

如需 Gemini 模型支援的語言清單，請參閱Google 模型的資訊。如要進一步瞭解如何設計多模態提示，請參閱「設計多模態提示」。如要直接從行動和網頁應用程式使用 Gemini，請參閱 Firebase AI Logic 用戶端 SDK，瞭解如何用於 Swift、Android、網頁、Flutter 和 Unity 應用程式。

在要求中新增音訊

你可以在提供給 Gemini 的要求中加入音訊檔案。

單一音訊

以下說明如何使用音訊檔案生成 Podcast 摘要。

控制台

如要使用 Google Cloud 控制台傳送多模態提示，請按照下列步驟操作：

在 Google Cloud 控制台的 Vertex AI 專區中，前往「Vertex AI Studio」頁面。

前往 Vertex AI Studio
按一下「建立提示」。
選用步驟：設定模型和參數：
- 模型：選取模型。

選用步驟：如要設定進階參數，請按一下「進階」，然後按照下列方式設定：

按一下即可展開進階設定

Top-K：使用滑桿或文字方塊輸入 Top-K 的值。
「Top-K」會影響模型選取輸出符記的方式。如果 Top-K 設為 1，代表下一個所選詞元是模型詞彙表的所有詞元中可能性最高者 (也稱為「貪婪解碼」)。如果 Top-K 設為 3，則代表模型會依據 temperature，從可能性最高的 3 個詞元中選取下一個詞元。
在每個符記選取步驟中，模型會對機率最高的「Top-K」符記取樣，接著進一步根據「Top-P」篩選詞元，最後依 temperature 選出最終詞元。

如要取得較不隨機的回覆，請指定較低的值；如要取得較隨機的回覆，請調高此值。
Top-P：使用滑桿或文字方塊輸入 Top-P 的值。模型會按照可能性最高到最低的順序選取符記，直到所選符記的可能性總和等於 Top-P 值。如要讓結果的變化性降到最低，請將 Top-P 設為 0。
最多回應數：使用滑桿或文字方塊輸入要生成的回應數值。
串流回應：啟用後，系統會在生成回應時顯示回應。
安全篩選器門檻：選取門檻，調整看見可能有害回應的機率。
啟用基礎：多模態提示不支援基礎功能。
區域：選取要使用的區域。

溫度：使用滑桿或文字方塊輸入溫度值。

    
The temperature is used for sampling during response generation, which occurs when topP
and topK are applied. Temperature controls the degree of randomness in token selection.
Lower temperatures are good for prompts that require a less open-ended or creative response, while
higher temperatures can lead to more diverse or creative results. A temperature of 0
means that the highest probability tokens are always selected. In this case, responses for a given
prompt are mostly deterministic, but a small amount of variation is still possible.

If the model returns a response that's too generic, too short, or the model gives a fallback
response, try increasing the temperature.
</li>
  <li>**Output token limit**: Use the slider or textbox to enter a value for
    the max output limit.

    
Maximum number of tokens that can be generated in the response. A token is
approximately four characters. 100 tokens correspond to roughly 60-80 words.

Specify a lower value for shorter responses and a higher value for potentially longer
responses.
</li>
  <li>**Add stop sequence**: Optional. Enter a stop sequence, which is a
    series of characters that includes spaces. If the model encounters a
    stop sequence, the response generation stops. The stop sequence isn't
    included in the response, and you can add up to five stop sequences.</li>
</ul>

按一下「插入媒體」，然後選取檔案來源。
上傳
選取要上傳的檔案，然後按一下「開啟」。

使用網址上傳
輸入要使用的檔案網址，然後按一下「插入」。

Cloud Storage
選取值區，然後從值區中選取要匯入的檔案，並按一下「選取」。
Google 雲端硬碟
1. 選擇帳戶，並在首次選取這個選項時，授權 Vertex AI Studio 存取您的帳戶。你可以上傳多個檔案，總大小上限為 10 MB。單一檔案不得超過 7 MB。
2. 按一下要新增的檔案。
3. 按一下「選取」。
  
  檔案縮圖會顯示在「提示」窗格中。系統也會顯示權杖總數。如果提示資料超過符記上限，系統會截斷符記，且不會將其納入資料處理程序。
在「提示」窗格中輸入文字提示。
選用：如要查看「權杖 ID 對應的文字」和「權杖 ID」，請按一下「提示」窗格中的「權杖數量」。

注意： 系統不支援媒體權杖。
按一下「提交」。
選用：如要將提示詞儲存至「我的提示詞」，請按一下「儲存」。
選用：如要取得提示的 Python 程式碼或 curl 指令，請依序點選「Build with code」(使用程式碼建構) >「Get code」(取得程式碼)。

Python

安裝

pip install --upgrade google-genai

詳情請參閱 SDK 參考說明文件。

設定環境變數，透過 Vertex AI 使用 Gen AI SDK：

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=global
export GOOGLE_GENAI_USE_VERTEXAI=True

from google import genai
from google.genai.types import HttpOptions, Part

client = genai.Client(http_options=HttpOptions(api_version="v1"))
prompt = """
Provide a concise summary of the main points in the audio file.
"""
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        prompt,
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/audio/pixel.mp3",
            mime_type="audio/mpeg",
        ),
    ],
)
print(response.text)
# Example response:
# Here's a summary of the main points from the audio file:

# The Made by Google podcast discusses the Pixel feature drops with product managers Aisha Sheriff and De Carlos Love.  The key idea is that devices should improve over time, with a connected experience across phones, watches, earbuds, and tablets.

Go

瞭解如何安裝或更新 Go。

詳情請參閱 SDK 參考說明文件。

設定環境變數，透過 Vertex AI 使用 Gen AI SDK：

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=global
export GOOGLE_GENAI_USE_VERTEXAI=True

import (
	"context"
	"fmt"
	"io"

	genai "google.golang.org/genai"
)

// generateWithAudio shows how to generate text using an audio input.
func generateWithAudio(w io.Writer) error {
	ctx := context.Background()

	client, err := genai.NewClient(ctx, &genai.ClientConfig{
		HTTPOptions: genai.HTTPOptions{APIVersion: "v1"},
	})
	if err != nil {
		return fmt.Errorf("failed to create genai client: %w", err)
	}

	modelName := "gemini-2.5-flash"
	contents := []*genai.Content{
		{Parts: []*genai.Part{
			{Text: `Provide the summary of the audio file.
Summarize the main points of the audio concisely.
Create a chapter breakdown with timestamps for key sections or topics discussed.`},
			{FileData: &genai.FileData{
				FileURI:  "gs://cloud-samples-data/generative-ai/audio/pixel.mp3",
				MIMEType: "audio/mpeg",
			}},
		},
			Role: "user"},
	}

	resp, err := client.Models.GenerateContent(ctx, modelName, contents, nil)
	if err != nil {
		return fmt.Errorf("failed to generate content: %w", err)
	}

	respText := resp.Text()

	fmt.Fprintln(w, respText)

	// Example response:
	// Here is a summary and chapter breakdown of the audio file:
	//
	// **Summary:**
	//
	// The audio file is a "Made by Google" podcast episode discussing the Pixel Feature Drops, ...
	//
	// **Chapter Breakdown:**
	//
	// *   **0:00 - 0:54:** Introduction to the podcast and guests, Aisha Sharif and DeCarlos Love.
	// ...

	return nil
}

Node.js

安裝

npm install @google/genai

詳情請參閱 SDK 參考說明文件。

設定環境變數，透過 Vertex AI 使用 Gen AI SDK：

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=global
export GOOGLE_GENAI_USE_VERTEXAI=True

const {GoogleGenAI} = require('@google/genai');

const GOOGLE_CLOUD_PROJECT = process.env.GOOGLE_CLOUD_PROJECT;
const GOOGLE_CLOUD_LOCATION = process.env.GOOGLE_CLOUD_LOCATION || 'global';

async function generateText(
  projectId = GOOGLE_CLOUD_PROJECT,
  location = GOOGLE_CLOUD_LOCATION
) {
  const client = new GoogleGenAI({
    vertexai: true,
    project: projectId,
    location: location,
  });

  const prompt =
    'Provide a concise summary of the main points in the audio file.';

  const response = await client.models.generateContent({
    model: 'gemini-2.5-flash',
    contents: [
      {
        fileData: {
          fileUri: 'gs://cloud-samples-data/generative-ai/audio/pixel.mp3',
          mimeType: 'audio/mpeg',
        },
      },
      {text: prompt},
    ],
  });

  console.log(response.text);

  // Example response:
  //  Here's a summary of the main points from the audio file:
  //  The Made by Google podcast discusses the Pixel feature drops with product managers Aisha Sheriff and De Carlos Love.  The key idea is that devices should improve over time, with a connected experience across phones, watches, earbuds, and tablets.

  return response.text;
}

Java

瞭解如何安裝或更新 Java。

詳情請參閱 SDK 參考說明文件。

設定環境變數，透過 Vertex AI 使用 Gen AI SDK：

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=global
export GOOGLE_GENAI_USE_VERTEXAI=True


import com.google.genai.Client;
import com.google.genai.types.Content;
import com.google.genai.types.GenerateContentResponse;
import com.google.genai.types.HttpOptions;
import com.google.genai.types.Part;

public class TextGenerationWithGcsAudio {

  public static void main(String[] args) {
    // TODO(developer): Replace these variables before running the sample.
    String modelId = "gemini-2.5-flash";
    generateContent(modelId);
  }

  // Generates text with audio input
  public static String generateContent(String modelId) {
    // Client Initialization. Once created, it can be reused for multiple requests.
    try (Client client =
        Client.builder()
            .location("global")
            .vertexAI(true)
            .httpOptions(HttpOptions.builder().apiVersion("v1").build())
            .build()) {

      GenerateContentResponse response =
          client.models.generateContent(
              modelId,
              Content.fromParts(
                  Part.fromUri(
                      "gs://cloud-samples-data/generative-ai/audio/pixel.mp3", "audio/mpeg"),
                  Part.fromText("Provide a concise summary of the main points in the audio file.")),
              null);

      System.out.print(response.text());
      // Example response:
      // The audio features Google product managers Aisha Sharif and D. Carlos Love discussing Pixel
      // Feature Drops, emphasizing their role in continually enhancing devices across the entire
      // Pixel ecosystem...
      return response.text();
    }
  }
}

REST

設定環境後，即可使用 REST 測試文字提示。下列範例會將要求傳送至發布商模型端點。

使用任何要求資料之前，請先替換以下項目：

PROJECT_ID：您的專案 ID。
FILE_URI：要納入提示的檔案 URI 或網址。可接受的值包括：
- Cloud Storage 值區 URI：物件必須可公開讀取，或位於傳送要求的 Google Cloud 專案中。對於 gemini-2.0-flash 和 gemini-2.0-flash-lite，大小上限為 2 GB。
- HTTP 網址：檔案網址必須可公開讀取。每項要求可指定一個影片檔案、一個音訊檔案，以及最多 10 個圖片檔案。音訊檔、影片檔和文件不得超過 15 MB。
- YouTube 影片網址：YouTube 影片必須由您用來登入 Google Cloud 控制台的帳戶擁有，或是設為公開。每個要求僅支援一個 YouTube 影片網址。
指定 fileURI 時，您也必須指定檔案的媒體類型 (mimeType)。如果啟用 VPC Service Controls，系統不支援為 fileURI 指定媒體檔案網址。

如果 Cloud Storage 中沒有音訊檔案，可以使用下列公開檔案： gs://cloud-samples-data/generative-ai/audio/pixel.mp3，MIME 類型為 audio/mp3。如要聆聽這段音訊，請開啟範例 MP3 檔案。
MIME_TYPE： data 或 fileUri 欄位中指定檔案的媒體類型。可接受的值包括：
按一下即可展開 MIME 類型
- application/pdf
- audio/mpeg
- audio/mp3
- audio/wav
- image/png
- image/jpeg
- image/webp
- text/plain
- video/mov
- video/mpeg
- video/mp4
- video/mpg
- video/avi
- video/wmv
- video/mpegps
- video/flv
```
TEXT
```
要加入提示的文字指令。例如： Please provide a summary for the audio. Provide chapter titles, be concise and short, no need to provide chapter summaries. Do not make up any information that is not part of the audio and do not be verbose.

如要傳送要求，請選擇以下其中一個選項：

curl

注意： 下列指令假設您已執行 gcloud init 或 gcloud auth login，透過使用者帳戶登入 gcloud CLI，或使用 Cloud Shell，自動登入 gcloud CLI。您可以執行 gcloud auth list 查看目前有效的帳戶。

將要求主體儲存在名為 request.json 的檔案中。在終端機中執行下列指令，在目前目錄中建立或覆寫這個檔案：

cat > request.json << 'EOF'
{
  "contents": {
    "role": "USER",
    "parts": [
      {
        "fileData": {
          "fileUri": "FILE_URI",
          "mimeType": "MIME_TYPE"
        }
      },
      {
        "text": "TEXT"
      }
    ]
  }
}
EOF

接著，請執行下列指令來傳送 REST 要求：

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/global/publishers/google/models/gemini-2.0-flash:generateContent"

PowerShell

注意： 下列指令假設您已執行 gcloud init 或 gcloud auth login，透過使用者帳戶登入 gcloud CLI。您可以執行 gcloud auth list 查看目前有效的帳戶。

將要求主體儲存在名為 request.json 的檔案中。在終端機中執行下列指令，在目前目錄中建立或覆寫這個檔案：

@'
{
  "contents": {
    "role": "USER",
    "parts": [
      {
        "fileData": {
          "fileUri": "FILE_URI",
          "mimeType": "MIME_TYPE"
        }
      },
      {
        "text": "TEXT"
      }
    ]
  }
}
'@  | Out-File -FilePath request.json -Encoding utf8

接著，請執行下列指令來傳送 REST 要求：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/global/publishers/google/models/gemini-2.0-flash:generateContent" | Select-Object -Expand Content

您應該會收到類似如下的 JSON 回應。

回應

{
  "candidates": [
    {
      "content": {
        "role": "model",
        "parts": [
          {
            "text": "## Made By Google Podcast - Pixel Feature Drops \n\n**Chapter 1: Transformative Pixel Features**\n\n**Chapter 2: Importance of Feature Drops**\n\n**Chapter 3: January's Feature Drop Highlights**\n\n**Chapter 4: March's Feature Drop Highlights for Pixel Watch**\n\n**Chapter 5: March's Feature Drop Highlights for Pixel Phones**\n\n**Chapter 6: Feature Drop Expansion to Other Devices**\n\n**Chapter 7: Deciding Which Features to Include in Feature Drops**\n\n**Chapter 8: Importance of User Feedback**\n\n**Chapter 9: When to Expect March's Feature Drop**\n\n**Chapter 10: Stand-Out Features from Past Feature Drops** \n"
          }
        ]
      },
      "finishReason": "STOP",
      "safetyRatings": [
        {
          "category": "HARM_CATEGORY_HATE_SPEECH",
          "probability": "NEGLIGIBLE",
          "probabilityScore": 0.05470151,
          "severity": "HARM_SEVERITY_NEGLIGIBLE",
          "severityScore": 0.07864238
        },
        {
          "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
          "probability": "NEGLIGIBLE",
          "probabilityScore": 0.027742893,
          "severity": "HARM_SEVERITY_NEGLIGIBLE",
          "severityScore": 0.050051305
        },
        {
          "category": "HARM_CATEGORY_HARASSMENT",
          "probability": "NEGLIGIBLE",
          "probabilityScore": 0.08678674,
          "severity": "HARM_SEVERITY_NEGLIGIBLE",
          "severityScore": 0.06108711
        },
        {
          "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
          "probability": "NEGLIGIBLE",
          "probabilityScore": 0.11899801,
          "severity": "HARM_SEVERITY_NEGLIGIBLE",
          "severityScore": 0.14706452
        }
      ]
    }
  ],
  "usageMetadata": {
    "promptTokenCount": 18883,
    "candidatesTokenCount": 150,
    "totalTokenCount": 19033
  }
}

請注意這個範例網址中的以下部分：

使用 generateContent 方法，要求在完整生成回覆後傳回。如要減少人類觀眾的延遲感，請使用 streamGenerateContent 方法，在生成回覆的同時串流回覆內容。
多模態模型 ID 位於網址尾端，方法之前 (例如 gemini-2.0-flash)。這個範例也可能支援其他模型。

音訊轉錄

以下說明如何使用音訊檔案轉錄訪談內容。如要為純音訊檔案啟用時間戳記解讀功能，請在 GenerationConfig 中啟用 audioTimestamp 參數。

控制台

如要使用 Google Cloud 控制台傳送多模態提示，請按照下列步驟操作：

在 Google Cloud 控制台的 Vertex AI 專區中，前往「Vertex AI Studio」頁面。

前往 Vertex AI Studio
按一下「建立提示」。
選用步驟：設定模型和參數：
- 模型：選取模型。

選用步驟：如要設定進階參數，請按一下「進階」，然後按照下列方式設定：

按一下即可展開進階設定

Top-K：使用滑桿或文字方塊輸入 Top-K 的值。
「Top-K」會影響模型選取輸出符記的方式。如果 Top-K 設為 1，代表下一個所選詞元是模型詞彙表的所有詞元中可能性最高者 (也稱為「貪婪解碼」)。如果 Top-K 設為 3，則代表模型會依據 temperature，從可能性最高的 3 個詞元中選取下一個詞元。
在每個符記選取步驟中，模型會對機率最高的「Top-K」符記取樣，接著進一步根據「Top-P」篩選詞元，最後依 temperature 選出最終詞元。

如要取得較不隨機的回覆，請指定較低的值；如要取得較隨機的回覆，請調高此值。
Top-P：使用滑桿或文字方塊輸入 Top-P 的值。模型會按照可能性最高到最低的順序選取符記，直到所選符記的可能性總和等於 Top-P 值。如要讓結果的變化性降到最低，請將 Top-P 設為 0。
最多回應數：使用滑桿或文字方塊輸入要生成的回應數值。
串流回應：啟用後，系統會在生成回應時顯示回應。
安全篩選器門檻：選取門檻，調整看見可能有害回應的機率。
啟用基礎：多模態提示不支援基礎功能。
區域：選取要使用的區域。

溫度：使用滑桿或文字方塊輸入溫度值。

    
The temperature is used for sampling during response generation, which occurs when topP
and topK are applied. Temperature controls the degree of randomness in token selection.
Lower temperatures are good for prompts that require a less open-ended or creative response, while
higher temperatures can lead to more diverse or creative results. A temperature of 0
means that the highest probability tokens are always selected. In this case, responses for a given
prompt are mostly deterministic, but a small amount of variation is still possible.

If the model returns a response that's too generic, too short, or the model gives a fallback
response, try increasing the temperature.
</li>
  <li>**Output token limit**: Use the slider or textbox to enter a value for
    the max output limit.

    
Maximum number of tokens that can be generated in the response. A token is
approximately four characters. 100 tokens correspond to roughly 60-80 words.

Specify a lower value for shorter responses and a higher value for potentially longer
responses.
</li>
  <li>**Add stop sequence**: Optional. Enter a stop sequence, which is a
    series of characters that includes spaces. If the model encounters a
    stop sequence, the response generation stops. The stop sequence isn't
    included in the response, and you can add up to five stop sequences.</li>
</ul>

按一下「插入媒體」，然後選取檔案來源。
上傳
選取要上傳的檔案，然後按一下「開啟」。

使用網址上傳
輸入要使用的檔案網址，然後按一下「插入」。

Cloud Storage
選取值區，然後從值區中選取要匯入的檔案，並按一下「選取」。
Google 雲端硬碟
1. 選擇帳戶，並在首次選取這個選項時，授權 Vertex AI Studio 存取您的帳戶。你可以上傳多個檔案，總大小上限為 10 MB。單一檔案不得超過 7 MB。
2. 按一下要新增的檔案。
3. 按一下「選取」。
  
  檔案縮圖會顯示在「提示」窗格中。系統也會顯示權杖總數。如果提示資料超過符記上限，系統會截斷符記，且不會將其納入資料處理程序。
在「提示」窗格中輸入文字提示。
選用：如要查看「權杖 ID 對應的文字」和「權杖 ID」，請按一下「提示」窗格中的「權杖數量」。

注意： 系統不支援媒體權杖。
按一下「提交」。
選用：如要將提示詞儲存至「我的提示詞」，請按一下「儲存」。
選用：如要取得提示的 Python 程式碼或 curl 指令，請依序點選「Build with code」(使用程式碼建構) >「Get code」(取得程式碼)。

Python

安裝

pip install --upgrade google-genai

詳情請參閱 SDK 參考說明文件。

設定環境變數，透過 Vertex AI 使用 Gen AI SDK：

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=global
export GOOGLE_GENAI_USE_VERTEXAI=True

from google import genai
from google.genai.types import GenerateContentConfig, HttpOptions, Part

client = genai.Client(http_options=HttpOptions(api_version="v1"))
prompt = """
Transcribe the interview, in the format of timecode, speaker, caption.
Use speaker A, speaker B, etc. to identify speakers.
"""
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        prompt,
        Part.from_uri(
            file_uri="gs://cloud-samples-data/generative-ai/audio/pixel.mp3",
            mime_type="audio/mpeg",
        ),
    ],
    # Required to enable timestamp understanding for audio-only files
    config=GenerateContentConfig(audio_timestamp=True),
)
print(response.text)
# Example response:
# [00:00:00] **Speaker A:** your devices are getting better over time. And so ...
# [00:00:14] **Speaker B:** Welcome to the Made by Google podcast where we meet ...
# [00:00:20] **Speaker B:** Here's your host, Rasheed Finch.
# [00:00:23] **Speaker C:** Today we're talking to Aisha Sharif and DeCarlos Love. ...
# ...

Go

瞭解如何安裝或更新 Go。

詳情請參閱 SDK 參考說明文件。

設定環境變數，透過 Vertex AI 使用 Gen AI SDK：

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=global
export GOOGLE_GENAI_USE_VERTEXAI=True

import (
	"context"
	"fmt"
	"io"

	genai "google.golang.org/genai"
)

// generateAudioTranscript shows how to generate an audio transcript.
func generateAudioTranscript(w io.Writer) error {
	ctx := context.Background()

	client, err := genai.NewClient(ctx, &genai.ClientConfig{
		HTTPOptions: genai.HTTPOptions{APIVersion: "v1"},
	})
	if err != nil {
		return fmt.Errorf("failed to create genai client: %w", err)
	}

	modelName := "gemini-2.5-flash"
	contents := []*genai.Content{
		{Parts: []*genai.Part{
			{Text: `Transcribe the interview, in the format of timecode, speaker, caption.
Use speaker A, speaker B, etc. to identify speakers.`},
			{FileData: &genai.FileData{
				FileURI:  "gs://cloud-samples-data/generative-ai/audio/pixel.mp3",
				MIMEType: "audio/mpeg",
			}},
		},
			Role: "user"},
	}

	resp, err := client.Models.GenerateContent(ctx, modelName, contents, nil)
	if err != nil {
		return fmt.Errorf("failed to generate content: %w", err)
	}

	respText := resp.Text()

	fmt.Fprintln(w, respText)

	// Example response:
	// 00:00:00, A: your devices are getting better over time.
	// 00:01:13, A: And so we think about it across the entire portfolio from phones to watch, ...
	// ...

	return nil
}

Node.js

安裝

npm install @google/genai

詳情請參閱 SDK 參考說明文件。

設定環境變數，透過 Vertex AI 使用 Gen AI SDK：

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=global
export GOOGLE_GENAI_USE_VERTEXAI=True

const {GoogleGenAI} = require('@google/genai');

const GOOGLE_CLOUD_PROJECT = process.env.GOOGLE_CLOUD_PROJECT;
const GOOGLE_CLOUD_LOCATION = process.env.GOOGLE_CLOUD_LOCATION || 'global';

async function generateText(
  projectId = GOOGLE_CLOUD_PROJECT,
  location = GOOGLE_CLOUD_LOCATION
) {
  const client = new GoogleGenAI({
    vertexai: true,
    project: projectId,
    location: location,
  });

  const prompt = `Transcribe the interview, in the format of timecode, speaker, caption.
    Use speaker A, speaker B, etc. to identify speakers.`;

  const response = await client.models.generateContent({
    model: 'gemini-2.5-flash',
    contents: [
      {text: prompt},
      {
        fileData: {
          fileUri: 'gs://cloud-samples-data/generative-ai/audio/pixel.mp3',
          mimeType: 'audio/mpeg',
        },
      },
    ],
    // Required to enable timestamp understanding for audio-only files
    config: {
      audioTimestamp: true,
    },
  });

  console.log(response.text);

  // Example response:
  // [00:00:00] **Speaker A:** your devices are getting better over time. And so ...
  // [00:00:14] **Speaker B:** Welcome to the Made by Google podcast where we meet ...
  // [00:00:20] **Speaker B:** Here's your host, Rasheed Finch.
  // [00:00:23] **Speaker C:** Today we're talking to Aisha Sharif and DeCarlos Love. ...
  // ...

  return response.text;
}

Java

瞭解如何安裝或更新 Java。

詳情請參閱 SDK 參考說明文件。

設定環境變數，透過 Vertex AI 使用 Gen AI SDK：

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=global
export GOOGLE_GENAI_USE_VERTEXAI=True


import com.google.genai.Client;
import com.google.genai.types.Content;
import com.google.genai.types.GenerateContentConfig;
import com.google.genai.types.GenerateContentResponse;
import com.google.genai.types.HttpOptions;
import com.google.genai.types.Part;

public class TextGenerationTranscriptWithGcsAudio {

  public static void main(String[] args) {
    // TODO(developer): Replace these variables before running the sample.
    String modelId = "gemini-2.5-flash";
    generateContent(modelId);
  }

  // Generates transcript with audio input
  public static String generateContent(String modelId) {
    // Client Initialization. Once created, it can be reused for multiple requests.
    try (Client client =
        Client.builder()
            .location("global")
            .vertexAI(true)
            .httpOptions(HttpOptions.builder().apiVersion("v1").build())
            .build()) {

      String prompt =
          "Transcribe the interview, in the format of timecode, speaker, caption.\n"
              + "Use speaker A, speaker B, etc. to identify speakers.";

      // Enable audioTimestamp to generate timestamps for audio-only files.
      GenerateContentConfig contentConfig =
          GenerateContentConfig.builder().audioTimestamp(true).build();

      GenerateContentResponse response =
          client.models.generateContent(
              modelId,
              Content.fromParts(
                  Part.fromUri(
                      "gs://cloud-samples-data/generative-ai/audio/pixel.mp3", "audio/mpeg"),
                  Part.fromText(prompt)),
              contentConfig);

      System.out.print(response.text());
      // Example response:
      // 00:00 - Speaker A: your devices are getting better over time. And so we think about it...
      // 00:14 - Speaker B: Welcome to the Made by Google Podcast, where we meet the people who...
      // 00:41 - Speaker A: So many features. I am a singer, so I actually think recorder...
      return response.text();
    }
  }
}

REST

設定環境後，即可使用 REST 測試文字提示。下列範例會將要求傳送至發布商模型端點。

使用任何要求資料之前，請先替換以下項目：

PROJECT_ID：。
FILE_URI：要納入提示的檔案 URI 或網址。可接受的值包括：
- Cloud Storage 值區 URI：物件必須可公開讀取，或位於傳送要求的 Google Cloud 專案中。對於 gemini-2.0-flash 和 gemini-2.0-flash-lite，大小上限為 2 GB。
- HTTP 網址：檔案網址必須可公開讀取。每項要求可指定一個影片檔案、一個音訊檔案，以及最多 10 個圖片檔案。音訊檔、影片檔和文件不得超過 15 MB。
- YouTube 影片網址：YouTube 影片必須由您用來登入 Google Cloud 控制台的帳戶擁有，或是設為公開。每個要求僅支援一個 YouTube 影片網址。
指定 fileURI 時，您也必須指定檔案的媒體類型 (mimeType)。如果啟用 VPC Service Controls，系統不支援為 fileURI 指定媒體檔案網址。

如果 Cloud Storage 中沒有音訊檔案，可以使用下列公開檔案： gs://cloud-samples-data/generative-ai/audio/pixel.mp3，MIME 類型為 audio/mp3。如要聆聽這段音訊，請開啟範例 MP3 檔案。
MIME_TYPE： data 或 fileUri 欄位中指定檔案的媒體類型。可接受的值包括：
按一下即可展開 MIME 類型
- application/pdf
- audio/mpeg
- audio/mp3
- audio/wav
- image/png
- image/jpeg
- image/webp
- text/plain
- video/mov
- video/mpeg
- video/mp4
- video/mpg
- video/avi
- video/wmv
- video/mpegps
- video/flv
```
TEXT
```
要加入提示的文字指令。例如： Can you transcribe this interview, in the format of timecode, speaker, caption. Use speaker A, speaker B, etc. to identify speakers.

如要傳送要求，請選擇以下其中一個選項：

curl

將要求主體儲存在名為 request.json 的檔案中。在終端機中執行下列指令，在目前目錄中建立或覆寫這個檔案：

cat > request.json << 'EOF'
{
  "contents": {
    "role": "USER",
    "parts": [
      {
        "fileData": {
          "fileUri": "FILE_URI",
          "mimeType": "MIME_TYPE"
        }
      },
      {
        "text": "TEXT"
      }
    ]
  },
  "generatationConfig": {
    "audioTimestamp": true
  }
}
EOF

接著，請執行下列指令來傳送 REST 要求：

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/global/publishers/google/models/gemini-2.0-flash:generateContent"

PowerShell

注意： 下列指令假設您已執行 gcloud init 或 gcloud auth login，透過使用者帳戶登入 gcloud CLI。您可以執行 gcloud auth list 查看目前有效的帳戶。

將要求主體儲存在名為 request.json 的檔案中。在終端機中執行下列指令，在目前目錄中建立或覆寫這個檔案：

@'
{
  "contents": {
    "role": "USER",
    "parts": [
      {
        "fileData": {
          "fileUri": "FILE_URI",
          "mimeType": "MIME_TYPE"
        }
      },
      {
        "text": "TEXT"
      }
    ]
  },
  "generatationConfig": {
    "audioTimestamp": true
  }
}
'@  | Out-File -FilePath request.json -Encoding utf8

接著，請執行下列指令來傳送 REST 要求：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/global/publishers/google/models/gemini-2.0-flash:generateContent" | Select-Object -Expand Content

您應該會收到類似如下的 JSON 回應。

回應

{
  "candidates": [
    {
      "content": {
        "role": "model",
        "parts": [
          {
            "text": "0:00 Speaker A: Your devices are getting better over time, and so we think
              about it across the entire portfolio from phones to watch to buds to tablet. We get
              really excited about how we can tell a joint narrative across everything.
              0:18 Speaker B: Welcome to the Made By Google Podcast, where we meet the people who
              work on the Google products you love. Here's your host, Rasheed.
              0:33 Speaker B: Today we're talking to Aisha and DeCarlos. They're both
              Product Managers for various Pixel devices and work on something that all the Pixel
              owners love. The Pixel feature drops. This is the Made By Google Podcast. Aisha, which
              feature on your Pixel phone has been most transformative in your own life?
              0:56 Speaker A: So many features. I am a singer, so I actually think recorder
              transcription has been incredible because before I would record songs I'd just like,
              freestyle them, record them, type them up. But now with transcription it works so well
              even deciphering lyrics that are jumbled. I think that's huge.
              ...
              Subscribe now wherever you get your podcasts to be the first to listen."
          }
        ]
      },
      "finishReason": "STOP",
      "safetyRatings": [
        {
          "category": "HARM_CATEGORY_HATE_SPEECH",
          "probability": "NEGLIGIBLE",
          "probabilityScore": 0.043609526,
          "severity": "HARM_SEVERITY_NEGLIGIBLE",
          "severityScore": 0.06255973
        },
        {
          "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
          "probability": "NEGLIGIBLE",
          "probabilityScore": 0.022328783,
          "severity": "HARM_SEVERITY_NEGLIGIBLE",
          "severityScore": 0.04426588
        },
        {
          "category": "HARM_CATEGORY_HARASSMENT",
          "probability": "NEGLIGIBLE",
          "probabilityScore": 0.07107367,
          "severity": "HARM_SEVERITY_NEGLIGIBLE",
          "severityScore": 0.049405243
        },
        {
          "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
          "probability": "NEGLIGIBLE",
          "probabilityScore": 0.10484337,
          "severity": "HARM_SEVERITY_NEGLIGIBLE",
          "severityScore": 0.13128456
        }
      ]
    }
  ],
  "usageMetadata": {
    "promptTokenCount": 18871,
    "candidatesTokenCount": 2921,
    "totalTokenCount": 21792
  }
}

請注意這個範例網址中的以下部分：

使用 generateContent 方法，要求在完整生成回覆後傳回。如要減少人類觀眾的延遲感，請使用 streamGenerateContent 方法，在生成回覆的同時串流回覆內容。
多模態模型 ID 位於網址尾端，方法之前 (例如 gemini-2.0-flash)。這個範例也可能支援其他模型。

設定選用模型參數

每個模型都有一組可供設定的選用參數。詳情請參閱內容生成參數。

限制

雖然 Gemini 多模態模型在許多多模態應用情境中都非常強大，但請務必瞭解模型的限制：

辨識非語音聲音：支援音訊的模型可能會誤認非語音聲音。
純音訊時間戳記：如要準確產生純音訊檔案的時間戳記，請務必在 generation_config 中設定 audio_timestamp 參數。

後續步驟

開始使用 Gemini 多模態模型建構內容 - 新客戶可獲得價值 $300 美元的免費抵免額 Google Cloud ，探索 Gemini 的功能。
瞭解如何傳送即時通訊提示要求。
瞭解負責任的 AI 最佳做法和 Vertex AI 的安全篩選器。

音訊理解 (僅限語音)

支援的模型

在要求中新增音訊

單一音訊

控制台

按一下即可展開進階設定

上傳

使用網址上傳

Cloud Storage

Google 雲端硬碟

Python

安裝

Go

Node.js

安裝

Java

REST

curl

PowerShell

回應

音訊轉錄

控制台

按一下即可展開進階設定

上傳

使用網址上傳

Cloud Storage

Google 雲端硬碟

Python

安裝

Go

Node.js

安裝

Java

REST

curl

PowerShell

回應

設定選用模型參數

限制

後續步驟