Infra & cost

KV Cache

An inference-time cache that stores intermediate attention computations so a model doesn’t re-process its earlier tokens on every new token.

01 ——

In plain English

The KV cache (key-value cache) is the inference optimisation that makes long conversations affordable. As a transformer generates each new token, it normally has to recompute attention over every prior token. The KV cache stores those intermediate results so each new token only adds a small incremental cost.

Why it matters:

Without a KV cache, generating a 1,000-token response would take O(n²) compute
With it, generation is O(n) — linear in context length
It's the single biggest reason long-context models are usable

Practical implications:

Memory — the KV cache grows with context length and can hit gigabytes for 200K-token chats
Sharing — providers like Anthropic and OpenAI cache across requests with the same prefix (this is what prompt caching uses)
Eviction — long-running sessions evict older entries; some products silently re-prefill
Reset cost — switching the system prompt invalidates the cache and rebuilds it (latency spike)

Why you'd think about it: If you're building agentic workflows, KV cache hits drive both latency and cost. Structuring prompts to keep the cache warm is a real optimisation.

02 ——