KV Cache
An inference-time cache that stores intermediate attention computations so a model doesn’t re-process its earlier tokens on every new token.
In plain English
The KV cache (key-value cache) is the inference optimisation that makes long conversations affordable. As a transformer generates each new token, it normally has to recompute attention over every prior token. The KV cache stores those intermediate results so each new token only adds a small incremental cost.
Why it matters:
- Without a KV cache, generating a 1,000-token response would take O(n²) compute
- With it, generation is O(n) — linear in context length
- It's the single biggest reason long-context models are usable
Practical implications:
- Memory — the KV cache grows with context length and can hit gigabytes for 200K-token chats
- Sharing — providers like Anthropic and OpenAI cache across requests with the same prefix (this is what prompt caching uses)
- Eviction — long-running sessions evict older entries; some products silently re-prefill
- Reset cost — switching the system prompt invalidates the cache and rebuilds it (latency spike)
Why you'd think about it: If you're building agentic workflows, KV cache hits drive both latency and cost. Structuring prompts to keep the cache warm is a real optimisation.