{"contents":[{"role":"user","parts":[{"fileData":{"mimeType":"audio/mpeg","fileUri":"gs://cloud-samples-data/generative-ai/audio/pixel.mp3"}},{"text":"Please summarize the conversation in one sentence."}]},{"role":"model","parts":[{"text":"The podcast episode features two product managers for Pixel devices discussing the new features coming to Pixel phones and watches."}]}]}
[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-04 (世界標準時間)。"],[],[],null,["# Audio Tuning\n\nThis page provides prerequisites and detailed instructions for fine-tuning\nGemini on audio data using supervised learning.\n\nUse cases\n---------\n\nTuning audio models enhances their performance by tailoring them to specific\nneeds. This can involve improving speech recognition for different accents,\nfine-tuning music genre classification, optimizing sound event detection,\ncustomizing audio generation, adapting to noisy environments, improving audio\nquality, and personalizing audio experiences. Here are some common audio tuning use\ncases:\n\n- **Enhanced voice assistants**:\n\n - Voice food ordering: Develop voice-activated systems for seamless food ordering and delivery.\n- **Audio content analysis**:\n\n - Automated transcription: Generate highly accurate transcripts, even in noisy environments.\n - Audio summarization: Summarize key points from podcasts or audiobooks.\n - Music classification: Categorize music based on genre, mood, or other characteristics.\n- **Accessibility and assistive technologies**:\n\n - Real-time captioning: Provide live captions for events or video calls.\n - Voice-controlled applications: Develop applications controlled entirely by voice.\n - Language learning: Create tools that provide personalized feedback on pronunciation.\n\nLimitations\n-----------\n\n### Gemini 2.5 models\n\n### Gemini 2.0 Flash\nGemini 2.0 Flash-Lite\n\nTo learn more about audio sample requirements, see the [Audio understanding (speech only)](/vertex-ai/generative-ai/docs/multimodal/audio-understanding#audio-requirements) page.\n\nDataset format\n--------------\n\nThe `fileUri` for your dataset can be the URI for a file in a Cloud Storage\nbucket, or it can be a publicly available HTTP or HTTPS URL.\n\nTo see the generic format example, see\n[Dataset example for Gemini](/vertex-ai/generative-ai/docs/models/gemini-supervised-tuning-prepare#dataset-example).\n\nThe following is an example of an audio dataset. \n\n {\n \"contents\": [\n {\n \"role\": \"user\",\n \"parts\": [\n {\n \"fileData\": {\n \"mimeType\": \"audio/mpeg\",\n \"fileUri\": \"gs://cloud-samples-data/generative-ai/audio/pixel.mp3\"\n }\n },\n {\n \"text\": \"Please summarize the conversation in one sentence.\"\n }\n ]\n },\n {\n \"role\": \"model\",\n \"parts\": [\n {\n \"text\": \"The podcast episode features two product managers for Pixel devices discussing the new features coming to Pixel phones and watches.\"\n }\n ]\n }\n ]\n }\n\nWhat's next\n-----------\n\n- To learn more about the Gemini audio understanding model, see [Audio understanding (speech only)](/vertex-ai/generative-ai/docs/multimodal/audio-understanding).\n- To start tuning, see [Tune Gemini models by using supervised fine-tuning](/vertex-ai/generative-ai/docs/models/gemini-use-supervised-tuning).\n- To learn how supervised fine-tuning can be used in a solution that builds a generative AI knowledge base, see [Jump Start Solution: Generative AI\n knowledge base](/architecture/ai-ml/generative-ai-knowledge-base)."]]