This guide describes best practices for working with large language models (LLMs). It covers the following topics: To learn about best practices for multimodal prompts, see the page for the modality that you're working with: When you build interactive applications, response time (latency) is a crucial part of the user experience. This section explains latency for Vertex AI LLM APIs and provides strategies to reduce it. Latency is the time it takes for a model to process your input prompt and generate a response. When you evaluate latency, consider the following metrics: To reduce latency and improve the responsiveness of your applications, you can use the following strategies with Vertex AI: Choose the right model for your use case. Vertex AI offers a range of models with different capabilities and performance characteristics. To choose the best model for your use case, evaluate your requirements for speed and output quality. For a list of available models, see Explore all models. Optimize prompt and output length. The number of tokens in your input prompt and the expected output directly impacts processing time. To reduce latency, minimize your token count. Stream responses. When you use streaming, the model sends its response as it's being generated, instead of waiting for the complete output. This lets you process the output in real time, so you can immediately update your user interface and perform other concurrent tasks. Streaming improves perceived responsiveness and creates a more interactive user experience.
Multimodal prompts
Reduce latency
Understanding latency metrics for LLMs
Strategies to reduce latency
temperature
. To control the randomness of the output, experiment with the temperature
parameter. Lower temperature
values can lead to shorter, more focused responses. Higher values can result in more diverse but potentially longer outputs. For more information, see temperature
in the model parameters reference.max_output_tokens
parameter to set a maximum length for the generated response. Be aware that this might cut off responses mid-sentence.What's next
Best practices with large language models (LLMs)
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-15 UTC.