Llm
Prefix Caching: Reusing KV Cache Across Requests
· ☕ 8 min read · âœī¸ k4i
When thousands of requests share the same system prompt, recomputing its KV cache each time is pure waste. Prefix caching stores and reuses those vectors, cutting TTFT by up to 97% in common deployments.
Prefix Caching: Reusing KV Cache Across Requests
Paged Attention: Virtual Memory for the GPU
· ☕ 10 min read · âœī¸ k4i
How vLLM borrows the OS paging idea to eliminate KV cache memory fragmentation, pushing GPU utilization from ~30% to ~96%.
Paged Attention: Virtual Memory for the GPU
Why KV Cache Works in LLM Inference
· ☕ 8 min read · âœī¸ k4i
why the key-value cache avoids redundant computation in autoregressive decoding, and the memory/compute tradeoffs it introduces.
Why KV Cache Works in LLM Inference