Context caching aims to reduce the cost and latency of requests to
Gemini that contain repeated content. By default, Google automatically
caches inputs for all Gemini models to reduce latency and accelerate
responses for subsequent prompts. Through the Vertex AI API, you can
create context caches
and exercise more control over them by: You can also use the Vertex AI API to
get information about a context cache. Caches created using the Vertex AI API interact with default Google caching,
which can result in additional caching beyond the contents specified when
creating a cache.
To achieve zero cache data retention, disable default Google
caching and refrain from creating caches with the Vertex API. See
enabling and disabling caching
for more information. For the Gemini 2.5 and later model families, the cached input tokens
are charged at a 75% discount relative to
standard input tokens when a cache hit occurs. View cache hit token information in the responses metadata field. To disable
this, refer to Generative AI and data governance. Context caching is particularly well suited to scenarios where a substantial
initial context is referenced repeatedly by subsequent requests. Cached context items, such as a large amount of text, an audio file, or a
video file, can be used in prompt requests to the Gemini API to
generate output. Requests that use the same cache in the prompt also include
text unique to each prompt. For example, each prompt request that composes a
chat conversation might include the same context cache that
references a video along with unique text that comprises each turn in the chat. Consider using context caching for use cases such as: Context caching is a paid feature designed to reduce overall operational costs.
Billing is based on the following factors: The number of tokens in the cached part of your input can be found in the
metadata field of your response, under the Context caching support for Provisioned Throughput is in
Preview for default caching. Context caching
using the Vertex AI API is not supported for Provisioned Throughput. Refer to
the
Provisioned Throughput guide
for more details. The following Gemini models support context caching: For more information, see
Available Gemini stable model versions. Note that context caching supports
all MIME types for supported models. Context caching is available in regions where Generative AI on Vertex AI is
available. For more information, see
Generative AI on Vertex AI locations. Context caching supports VPC Service Controls, meaning your cache cannot be
exfiltrated beyond your service perimeter. If you use Cloud Storage to
build your cache, include your bucket in your service perimeter as well to
protect your cache content. For more information, see VPC Service Controls with
Vertex AI in the
Vertex AI documentation.
When to use context caching
Cost-efficiency through caching
cachedContentTokenCount
field.Supported models
Availability
VPC Service Controls support
What's next
Context caching overview
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-15 UTC.