This guide describes best practices for working with large language models (LLMs). It covers the following topics:
- Multimodal prompts: Find links to best practices for prompts that include images, video, audio, or documents.
- Reduce latency: Discover strategies to improve model response time for a better user experience.
Multimodal prompts
To learn about best practices for multimodal prompts, see the page for the modality that you're working with:
Reduce latency
When you build interactive applications, response time (latency) is a crucial part of the user experience. This section explains latency for Vertex AI LLM APIs and provides strategies to reduce it.
Understanding latency metrics for LLMs
Latency is the time it takes for a model to process your input prompt and generate a response.
When you evaluate latency, consider the following metrics:
- Time to first token (TTFT): The time it takes for the model to return the first token of the response after it receives the prompt. TTFT is especially important for streaming applications where immediate feedback is crucial.
- Time to last token (TTLT): The overall time taken by the model to process the prompt and generate the complete response.
Strategies to reduce latency
To reduce latency and improve the responsiveness of your applications, you can use the following strategies with Vertex AI:
Choose the right model for your use case. Vertex AI offers a range of models with different capabilities and performance characteristics. To choose the best model for your use case, evaluate your requirements for speed and output quality. For a list of available models, see Explore all models.
Optimize prompt and output length. The number of tokens in your input prompt and the expected output directly impacts processing time. To reduce latency, minimize your token count.
- Write clear and concise prompts that convey your intent without unnecessary details. Shorter prompts reduce the time to first token.
- To control the length of the response, use system instructions. You can instruct the model to provide concise answers or limit the output to a specific number of sentences or paragraphs. This strategy can reduce the time to last token.
- Adjust the
temperature
. To control the randomness of the output, experiment with thetemperature
parameter. Lowertemperature
values can lead to shorter, more focused responses. Higher values can result in more diverse but potentially longer outputs. For more information, seetemperature
in the model parameters reference. - Set an output limit. To prevent overly long output, use the
max_output_tokens
parameter to set a maximum length for the generated response. Be aware that this might cut off responses mid-sentence.
Stream responses. When you use streaming, the model sends its response as it's being generated, instead of waiting for the complete output. This lets you process the output in real time, so you can immediately update your user interface and perform other concurrent tasks. Streaming improves perceived responsiveness and creates a more interactive user experience.
What's next
- Learn general prompt design strategies.
- See some sample prompts.
- Learn how to send chat prompts.
- Learn about responsible AI best practices and Vertex AI's safety filters.
- Learn how to tune a model.
- Learn about Provisioned Throughput to
assure production workloads.