For information on best practices for multimodal prompts, see the following
pages based on the modality that you're working with: When you build interactive applications, response time, also known as latency,
plays a crucial role in the user experience. This section explores the concept
of latency in the context of Vertex AI LLM APIs and provides
actionable strategies to minimize it and improve the response time of
your AI-powered applications. Latency refers to the time it takes for a model to process your input
prompt and generate a corresponding output response. When examining latency with a model, consider the following: Time to first token (TTFT) is the time that it takes for the model to produce
the first token of the response after receiving the prompt. TTFT is particularly
relevant for applications utilizing streaming, where providing immediate
feedback is crucial. Time to last token (TTLT) measures the overall time taken by the model to process
the prompt and generate the response. You can utilize several strategies with Vertex AI
to minimize latency and enhance the responsiveness of your applications: Vertex AI provides a diverse range of models with varying
capabilities and performance characteristics. Carefully evaluate your
requirements regarding speed and output quality to choose the model that best
aligns with your use case. For a list of available models, see
Explore all models. The number of tokens in both your input prompt and expected output directly
impacts processing time. Minimize your token count to reduce
latency. Craft clear and concise prompts that effectively convey your intent without
unnecessary details or redundancy. Shorter prompts reduce your time to first token. Use system instructions to control the length of the response. Instruct the
model to provide concise answers or limit the output to a specific number of
sentences or paragraphs. This strategy can reduce your time to last token. Adjust the Restrict output by setting a limit. Use the With streaming, the model starts sending its response before it generates the
complete output. This enables real-time processing of the output, and you can
immediately update your user interface and perform other concurrent tasks. Streaming enhances perceived responsiveness and creates a more interactive user
experience.Multimodal prompts
Reduce latency
Understanding latency metrics for LLMs
Strategies to reduce latency
Choose the right model for your use case
Optimize prompt and output length
temperature
. Experiment with the temperature
parameter to
control the randomness of the output. Lower temperature
values can lead to
shorter, more focused responses, while higher values can result in more
diverse, but potentially longer, outputs. For more information,
see temperature
in the model parameters reference.max_output_tokens
parameter to
set a maximum limit on the length of the generated response length, preventing
overly long output. However, be cautious as this might cut off responses
mid-sentence.Stream responses
What's next
Best practices with large language models (LLMs)
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-15 UTC.