Context caching overview

Context caching reduces cost and latency by letting you cache reusable parts of your prompts. This page covers the following topics:

When you send requests to Gemini that contain repeated content, context caching can reduce the cost and latency of those requests. By default, Google automatically caches inputs for all Gemini models to reduce latency and accelerate responses for subsequent prompts. For more fine-grained control, you can use the Vertex AI API to create and manage context caches.

Caching methods

The following table compares the available caching methods.

Method Description Control Use Case
Default caching Automatic caching managed by Google to reduce latency for all Gemini models. Limited. You can enable or disable it globally. It has a default expiration time of 60 minutes. General performance improvement for repeated prompts without requiring manual setup.
API-managed caching You explicitly create and manage caches through the Vertex AI API. Full control to create, use, update the expiration time, and delete specific caches. Applications with large, known, and repeatedly used contexts (for example, large documents or videos) that benefit from fine-grained control over the cache lifecycle.

When you use the Vertex AI API, you can manage caches in the following ways:

You can also use the Vertex AI API to get information about a context cache.

Note that caching requests using the Vertex AI API charges input tokens at the same 75% discount relative to standard input tokens and provides assured cost savings. There is also a storage charge based on the amount of time data is stored.

When to use context caching

Context caching is most effective in scenarios where a large initial context is referenced repeatedly by subsequent requests.

You can use cached context items, such as a large document or a video file, in prompt requests to the Gemini API. Each request can combine the same cached context with unique text. For example, in a chat conversation about a video, each prompt can reference the same cached video context along with the new text for each turn in the chat.

Consider using context caching for the following use cases:

  • Chatbots with extensive system instructions
  • Repetitive analysis of lengthy video files
  • Recurring queries against large document sets
  • Frequent code repository analysis or bug fixing

Because LLM responses are nondeterministic, using the same context cache and prompt doesn't guarantee identical model responses. A context cache stores parts of the input prompt, not the model's output.

Cost-efficiency through caching

Context caching is a paid feature designed to reduce your overall operational costs. You are billed for context caching based on the following factors:

  • Cache token count: The number of input tokens cached, billed at a reduced rate when included in subsequent prompts.
  • Storage duration: The amount of time cached tokens are stored, billed hourly. The cached tokens are deleted when a context cache expires.
  • Other factors: Other charges apply, such as for non-cached input tokens and output tokens.

You can find the number of tokens in the cached part of your input in the cachedContentTokenCount field of the response metadata. See cachedContentTokenCount.

View cache hit token information in the responses metadata field. To disable this, refer to Generative AI and data governance.

For pricing details, see Gemini and context caching on the Gemini pricing page.

Provisioned Throughput support

Context caching support for Provisioned Throughput is in Preview for default caching. Context caching using the Vertex AI API is not supported for Provisioned Throughput. For more details, see the Provisioned Throughput guide.

Fine-tuned model support

Context caching is supported by both base and fine-tuned Gemini models. For more information, see Context cache for fine-tuned Gemini models.

Supported models

The following Gemini models support context caching:

For more information, see Available Gemini stable model versions. Context caching supports all MIME types for supported models.

Availability

Context caching is available in regions where Generative AI on Vertex AI is available. For more information, see Generative AI on Vertex AI locations.

VPC Service Controls support

Context caching supports VPC Service Controls, which helps prevent your cache from being moved outside of your service perimeter. To protect your cache content when you use Cloud Storage to build your cache, include your bucket in your service perimeter.

For more information, see VPC Service Controls with Vertex AI.

What's next