Boosting LLM Performance with Tiered KV Cache on Google Kubernetes Engine

Danna Wang
Software Engineer
Large Language Models (LLMs) are powerful, but their performance can be bottlenecked by the immense NVIDIA GPU memory footprint of the Key-Value (KV) Cache. This cache, crucial for speeding up LLM inference by storing Key (K) and Value (V) matrices, directly impacts context length, concurrency, and overall system throughput. Our primary goal is to maximize the KV Cache hit ratio by intelligently expanding NVIDIA GPU High Bandwidth Memory (HBM) with a tiered node-local storage solution.
Our collaboration with the LMCache team (Kuntai Du, Jiayi Yao, and Yihua Cheng from Tensormesh) has led to the development of an innovative solution on Google Kubernetes Engine (GKE).
Tiered Storage: Expanding the KV Cache Beyond HBM
LMCache extends the KV Cache from the NVIDIA GPU's fast HBM (Tier 1) to larger, more cost-effective tiers like CPU RAM and local SSDs. This dramatically increases the total cache size, leading to a higher hit ratio and improved inference performance by keeping more data locally on the accelerator node. For GKE users, this means accommodating models with massive context windows while maintaining excellent performance.
Performance Benchmarking and Results
We designed tests to measure the performance of this tiered KV Cache by configuring workloads to fill each storage layer (HBM, CPU RAM, Local SSD). We benchmarked these configurations using various context lengths (1k, 5k, 10k, 50k, and 100k tokens), representing diverse use cases such as:
-
1k - 5k tokens: High-fidelity personas and complex instructions
-
10k tokens: Average user prompts (small RAG) or web page/article content
-
50k tokens: Prompt stuffing
-
100k tokens: Content equivalent to a long book
Our primary performance indicators were Time to First Token (TTFT), token input throughput, and end-to-end latency. The results highlight the best-performing storage setup for each KV Cache size and the performance improvements achieved.
Experiment Setup
We deployed a vLLM server on an A3 mega machine, leveraging local SSD for ephemeral storage via emptyDir.
-
Hardware: 8 × nvidia-h100-mega-80gb NVIDIA GPUs
-
Model: Llama-3.3-70B-Instruct
-
LMCache version: v0.3.3
-
Cache Configuration:
-
HBM only
-
HBM + CPU RAM
-
HBM + CPU RAM + Local SSD
-
Storage Resources: HBM: 640Gi, CPU RAM: 1Ti, Local SSD: 5Ti
-
Benchmark Tool: SGLang bench_serving
Requests: Tests were conducted with system prompt lengths of 1k, 5k, 10k, 50k, and 100k tokens. Each system prompt provided a shared context for a batch of 20 inference requests, with individual requests consisting of a unique 256-token input and generating a 512-token output.
Example Command:
Benchmark Results
Our tests explored different total KV Cache sizes. The following results highlight the optimal storage setup for each size and the performance improvements achieved:
Test 1: Cache (1.1M - 1.3M tokens) fits entirely within HBM
Results: In this scenario, adding slower storage tiers provided no advantage, making an HBM-only configuration the optimal setup.
Test 2: Cache (4.0M - 4.3M tokens) exceeds HBM capacity but fits within HBM + CPU RAM


Next Steps
-
Get started with the same setup mentioned above on GKE.



