インタラクティブなアプリケーションを構築する場合、レスポンス時間(レイテンシ)はユーザー エクスペリエンスにおいて重要な役割を果たします。このセクションでは、Vertex AI LLM API のコンテキストでレイテンシの概念について説明し、レイテンシを最小限に抑えて AI 搭載アプリケーションのレスポンス時間を改善するための実用的な戦略を示します。
temperature を調整します。temperature パラメータを試して、出力のランダム性を制御します。temperature の値が低いほど、より短く、より焦点を絞ったレスポンスが得られます。一方、値が高いほど、より多様で長い出力が得られます。詳細については、モデル パラメータ リファレンスの temperature をご覧ください。
[[["わかりやすい","easyToUnderstand","thumb-up"],["問題の解決に役立った","solvedMyProblem","thumb-up"],["その他","otherUp","thumb-up"]],[["わかりにくい","hardToUnderstand","thumb-down"],["情報またはサンプルコードが不正確","incorrectInformationOrSampleCode","thumb-down"],["必要な情報 / サンプルがない","missingTheInformationSamplesINeed","thumb-down"],["翻訳に関する問題","translationIssue","thumb-down"],["その他","otherDown","thumb-down"]],["最終更新日 2025-09-04 UTC。"],[],[],null,["# Best practices with large language models (LLMs)\n\nMultimodal prompts\n------------------\n\nFor information on best practices for multimodal prompts, see the following\npages based on the modality that you're working with:\n\n- [Image understanding](/vertex-ai/generative-ai/docs/multimodal/image-understanding)\n- [Video understanding](/vertex-ai/generative-ai/docs/multimodal/video-understanding)\n- [Audio understanding](/vertex-ai/generative-ai/docs/multimodal/audio-understanding)\n- [Document understanding](/vertex-ai/generative-ai/docs/multimodal/document-understanding)\n\nReduce latency\n--------------\n\nWhen you build interactive applications, response time, also known as latency,\nplays a crucial role in the user experience. This section explores the concept\nof latency in the context of Vertex AI LLM APIs and provides\nactionable strategies to minimize it and improve the response time of\nyour AI-powered applications.\n\n### Understanding latency metrics for LLMs\n\nLatency refers to the time it takes for a model to process your input\nprompt and generate a corresponding output response.\n\nWhen examining latency with a model, consider the following:\n\n*Time to first token (TTFT)* is the time that it takes for the model to produce\nthe first token of the response after receiving the prompt. TTFT is particularly\nrelevant for applications utilizing streaming, where providing immediate\nfeedback is crucial.\n\n*Time to last token (TTLT)* measures the overall time taken by the model to process\nthe prompt and generate the response.\n\n### Strategies to reduce latency\n\nYou can utilize several strategies with Vertex AI\nto minimize latency and enhance the responsiveness of your applications:\n\n#### Choose the right model for your use case\n\nVertex AI provides a diverse range of models with varying\ncapabilities and performance characteristics. Carefully evaluate your\nrequirements regarding speed and output quality to choose the model that best\naligns with your use case. For a list of available models, see\n[Explore all models](/vertex-ai/generative-ai/docs/model-garden/explore-models).\n\n#### Optimize prompt and output length\n\nThe number of tokens in both your input prompt and expected output directly\nimpacts processing time. Minimize your token count to reduce\nlatency.\n\n- Craft clear and concise prompts that effectively convey your intent without\n unnecessary details or redundancy. Shorter prompts reduce your time to first token.\n\n- Use *system instructions* to control the length of the response. Instruct the\n model to provide concise answers or limit the output to a specific number of\n sentences or paragraphs. This strategy can reduce your time to last token.\n\n- Adjust the `temperature`. Experiment with the `temperature` parameter to\n control the randomness of the output. Lower `temperature` values can lead to\n shorter, more focused responses, while higher values can result in more\n diverse, but potentially longer, outputs. For more information,\n see [`temperature` in the model parameters reference](/vertex-ai/generative-ai/docs/model-reference/gemini#parameters).\n\n- Restrict output by setting a limit. Use the `max_output_tokens` parameter to\n set a maximum limit on the length of the generated response length, preventing\n overly long output. However, be cautious as this might cut off responses\n mid-sentence.\n\n#### Stream responses\n\nWith streaming, the model starts sending its response before it generates the\ncomplete output. This enables real-time processing of the output, and you can\nimmediately update your user interface and perform other concurrent tasks.\n\nStreaming enhances perceived responsiveness and creates a more interactive user\nexperience.\n\nWhat's next\n-----------\n\n- Learn [general prompt design strategies](/vertex-ai/generative-ai/docs/learn/prompt-design-strategies).\n- See some [sample prompts](/vertex-ai/generative-ai/docs/prompt-gallery).\n- Learn how to [send chat prompts](/vertex-ai/generative-ai/docs/multimodal/send-chat-prompts-gemini).\n- Learn about [responsible AI best practices and Vertex AI's safety filters](/vertex-ai/generative-ai/docs/learn/responsible-ai).\n- Learn how to [tune a model](/vertex-ai/generative-ai/docs/models/tune-models).\n- Learn about [Provisioned Throughput](/vertex-ai/generative-ai/docs/provisioned-throughput) to assure production workloads."]]