Use context caching to reduce the cost of requests that contain repeat content with high input token counts. Cached context items, such as a large amount of text, an audio file, or a video file, can be used in prompt requests to the Gemini API to generate output. Requests that use the same cache in the prompt also include text unique to each prompt. For example, each prompt request that composes a chat conversation might include the same context cache that references a video along with unique text that comprises each turn in the chat. The minimum size of a context cache is 32,768 tokens.
Supported models
The following models support context caching:
- Stable versions of Gemini 1.5 Flash
- Stable versions of Gemini 1.5 Pro
For more information, see Available Gemini stable model versions.
Context caching is available in regions where Generative AI on Vertex AI is available. For more information, see Generative AI on Vertex AI locations.
Supported MIME types
Context caching supports the following MIME types:
application/pdf
audio/mp3
audio/mpeg
audio/wav
image/jpeg
image/png
text/plain
video/avi
video/flv
video/mov
video/mp4
video/mpeg
video/mpegps
video/mpg
video/wmv
When to use context caching
Context caching is particularly well suited to scenarios where a substantial initial context is referenced repeatedly by shorter requests. Consider using context caching for use cases such as:
- Chatbots with extensive system instructions
- Repetitive analysis of lengthy video files
- Recurring queries against large document sets
- Frequent code repository analysis or bug fixing
Cost-efficiency through caching
Context caching is a paid feature designed to reduce overall operational costs. Billing is based on the following factors:
- Cache token count: The number of input tokens cached, billed at a reduced rate when included in subsequent prompts.
- Storage duration: The amount of time cached tokens are stored, billed hourly. The cached tokens are deleted when a context cache expires.
- Other factors: Other charges apply, such as for non-cached input tokens and output tokens.
How to use a context cache
To use context caching, you first create the context cache. To reference the contents of the context cache in a prompt request, use its resource name. You can locate the resource name of a context cache in the response of the command used to create it.
Each context cache has a default expiration time that's 60 minutes after its creation time. If needed, you can specify a different expiration time when you create the context cache or update the expiration time of an unexpired context cache.
The following topics include details and samples that help you create, use, update, get information about, and delete a context cache:
- Create a context cache
- Use a context cache
- Get information about a context cache
- Update the expiration time of a context cache
- Delete a context cache
VPC Service Controls support
Context caching supports VPC Service Controls, meaning your cache cannot be exfiltrated beyond your service perimeter. If you use Cloud Storage to build your cache, include your bucket in your service perimeter as well to protect your cache content.
For more information, see VPC Service Controls with Vertex AI in the Vertex AI documentation.
What's next
- Learn about the Gemini API.
- Learn how to use multimodal prompts.