Infra & cost

KV Cache

An inference-time cache that stores intermediate attention computations so a model doesn’t re-process its earlier tokens on every new token.

01 ——

In plain English

The KV cache (key-value cache) is the inference optimisation that makes long conversations affordable. As a transformer generates each new token, it normally has to recompute attention over every prior token. The KV cache stores those intermediate results so each new token only adds a small incremental cost.

Why it matters:

  • Without a KV cache, generating a 1,000-token response would take O(n²) compute
  • With it, generation is O(n) — linear in context length
  • It's the single biggest reason long-context models are usable

Practical implications:

  • Memory — the KV cache grows with context length and can hit gigabytes for 200K-token chats
  • Sharing — providers like Anthropic and OpenAI cache across requests with the same prefix (this is what prompt caching uses)
  • Eviction — long-running sessions evict older entries; some products silently re-prefill
  • Reset cost — switching the system prompt invalidates the cache and rebuilds it (latency spike)

Why you'd think about it: If you're building agentic workflows, KV cache hits drive both latency and cost. Structuring prompts to keep the cache warm is a real optimisation.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI