Vertex AI API를 사용하여 요청을 캐시하면 표준 입력 토큰에 비해 동일한 75% 할인된 가격으로 입력 토큰이 청구되며 확실한 비용 절감이 보장됩니다. 데이터가 저장된 시간에 따라 스토리지 요금도 청구됩니다.
컨텍스트 캐싱을 사용하는 경우
컨텍스트 캐싱은 후속 요청에서 상당한 양의 초기 컨텍스트를 반복적으로 참조하는 시나리오에 특히 적합합니다.
대량의 텍스트, 오디오 파일 또는 동영상 파일과 같은 캐시된 컨텍스트 항목은 Gemini API에 대한 프롬프트 요청에서 출력을 생성하는 데 사용될 수 있습니다. 프롬프트에서 같은 캐시를 사용하는 요청에는 각 프롬프트에 고유한 텍스트도 포함됩니다. 예를 들어 채팅 대화를 구성하는 각 프롬프트 요청에는 채팅의 각 차례를 구성하는 고유한 텍스트와 함께 동영상을 참조하는 동일한 컨텍스트 캐시가 포함될 수 있습니다.
다음과 같은 사용 사례에 컨텍스트 캐싱을 사용하는 것이 좋습니다.
다양한 시스템 안내를 제공하는 챗봇
긴 동영상 파일 반복 분석
대규모 문서 세트에 대해 반복 쿼리
빈번한 코드 저장소 분석 또는 버그 수정
캐싱을 통한 경제성
컨텍스트 캐싱은 전반적인 운영 비용을 줄이기 위해 설계된 유료 기능입니다.
다음 요소를 기준으로 결제가 청구됩니다.
캐시 토큰 수: 캐시된 입력 토큰 수로, 후속 프롬프트에 포함될 경우 할인된 요율로 청구됩니다.
[[["이해하기 쉬움","easyToUnderstand","thumb-up"],["문제가 해결됨","solvedMyProblem","thumb-up"],["기타","otherUp","thumb-up"]],[["이해하기 어려움","hardToUnderstand","thumb-down"],["잘못된 정보 또는 샘플 코드","incorrectInformationOrSampleCode","thumb-down"],["필요한 정보/샘플이 없음","missingTheInformationSamplesINeed","thumb-down"],["번역 문제","translationIssue","thumb-down"],["기타","otherDown","thumb-down"]],["최종 업데이트: 2025-07-09(UTC)"],[],[],null,["# Context caching overview\n\n| To see an example of context caching,\n| run the \"Intro to context caching\" notebook in one of the following\n| environments:\n|\n| [Open in Colab](https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/context-caching/intro_context_caching.ipynb)\n|\n|\n| \\|\n|\n| [Open in Colab Enterprise](https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fcontext-caching%2Fintro_context_caching.ipynb)\n|\n|\n| \\|\n|\n| [Open\n| in Vertex AI Workbench](https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fcontext-caching%2Fintro_context_caching.ipynb)\n|\n|\n| \\|\n|\n| [View on GitHub](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/context-caching/intro_context_caching.ipynb)\n\n\u003cbr /\u003e\n\nContext caching helps reduce the cost and latency of requests to\nGemini that contain repeated content. Vertex AI offers two\ntypes of caching:\n\n- **Implicit caching:** Automatic caching enabled by default that provides cost savings when cache hits occur.\n- **Explicit caching:** Manual caching enabled using the Vertex AI API, where you explicitly declare the content you want to cache and whether or not your prompts should refer to the cache content.\n\nFor both implicit and explicit caching, the [`cachedContentTokenCount`](/vertex-ai/docs/reference/rest/v1/GenerateContentResponse#UsageMetadata)\nfield in your response's metadata indicates the number of tokens in the cached\npart of your input. Caching requests must contain a minimum of 2,048 tokens.\n\nBoth implicit and explicit caching are supported when using the following\nmodels:\n\n- [Gemini 2.5 Pro](/vertex-ai/generative-ai/docs/models/gemini/2-5-pro)\n- [Gemini 2.5 Flash](/vertex-ai/generative-ai/docs/models/gemini/2-5-flash)\n- [Gemini 2.5 Flash-Lite](/vertex-ai/generative-ai/docs/models/gemini/2-5-flash-lite)\n- [Gemini 2.0 Flash](/vertex-ai/generative-ai/docs/models/gemini/2-0-flash)\n- [Gemini 2.0 Flash-Lite](/vertex-ai/generative-ai/docs/models/gemini/2-0-flash-lite)\n\nFor both implicit and explicit caching, there is no additional charge to write\nto cache other than the standard input token costs. For explicit caching, there\nare storage costs based on how long caches are stored. There are no storage\ncosts for implicit caching. For more information, see [Vertex AI pricing](/vertex-ai/generative-ai/pricing#context-caching).\n\nImplicit caching\n----------------\n\nAll Google Cloud projects have implicit caching enabled by default. Implicit\ncaching provides a 75% discount on cached tokens compared to standard\ninput tokens.\n\nWhen enabled, implicit cache hit cost savings are automatically passed on to\nyou. To increase the chances of an implicit cache hit:\n\n- Place large and common contents at the beginning of your prompt.\n- Send requests with a similar prefix in a short amount of time.\n\nExplicit caching\n----------------\n\nExplicit caching offers more control and ensures a 75% discount when explicit\ncaches are referenced.\n\nUsing the Vertex AI API, you can:\n\n- [Create context caches](/vertex-ai/generative-ai/docs/context-cache/context-cache-create) and control them more effectively.\n- [Use a context cache](/vertex-ai/generative-ai/docs/context-cache/context-cache-use) by referencing its contents in a prompt request with its resource name.\n- [Update a context cache's expiration time (Time to Live, or TTL)](/vertex-ai/generative-ai/docs/context-cache/context-cache-update) past the default 60 minutes.\n- [Delete a context cache](/vertex-ai/generative-ai/docs/context-cache/context-cache-delete) when no longer needed.\n\nYou can also use the Vertex AI API to\n[retrieve information about a context cache](/vertex-ai/generative-ai/docs/context-cache/context-cache-getinfo).\n\nExplicit caches interact with implicit caching, potentially leading to\nadditional caching beyond the specified contents when [creating a cache](/vertex-ai/generative-ai/docs/context-cache/context-cache-create). To\nprevent cache data retention, disable implicit caching and avoid creating\nexplicit caches. For more information, see [Enable and disable caching](/vertex-ai/generative-ai/docs/data-governance#enabling-disabling-caching).\n\nWhen to use context caching\n---------------------------\n\nContext caching is particularly well suited to scenarios where a substantial\ninitial context is referenced repeatedly by subsequent requests.\n\nCached context items, such as a large amount of text, an audio file, or a video\nfile, can be used in prompt requests to the Gemini API to generate output.\nRequests that use the same cache in the prompt also include text unique to each\nprompt. For example, each prompt request that composes a chat conversation might\ninclude the same context cache that references a video along with unique text\nthat comprises each turn in the chat.\n\nConsider using context caching for use cases such as:\n\n- Chatbots with extensive system instructions\n- Repetitive analysis of lengthy video files\n- Recurring queries against large document sets\n- Frequent code repository analysis or bug fixing\n\nContext caching support for Provisioned Throughput is in\n[Preview](https://cloud.google.com/products#product-launch-stages) for implicit caching. Explicit caching is not supported\nfor Provisioned Throughput. Refer to the [Provisioned Throughput\nguide](/vertex-ai/generative-ai/docs/provisioned-throughput/measure-provisioned-throughput#context_caching) for more details.\n\nAvailability\n------------\n\nContext caching is available in regions where Generative AI on Vertex AI is\navailable. For more information, see [Generative AI on Vertex AI\nlocations](/vertex-ai/generative-ai/docs/learn/locations).\n\nVPC Service Controls support\n----------------------------\n\nContext caching supports VPC Service Controls, meaning your cache cannot be\nexfiltrated beyond your service perimeter. If you use Cloud Storage to build\nyour cache, include your bucket in your service perimeter as well to protect\nyour cache content.\n\nFor more information, see [VPC Service Controls with Vertex AI](/vertex-ai/docs/general/vpc-service-controls)\nin the Vertex AI documentation.\n\nWhat's next\n-----------\n\n- Learn about [the Gemini API](/vertex-ai/generative-ai/docs/overview).\n- Learn how to [use multimodal prompts](/vertex-ai/generative-ai/docs/multimodal/send-multimodal-prompts)."]]