Infra & cost

Prompt Caching

A feature that stores parts of a prompt the model has already processed, making repeat or follow-up requests much faster and cheaper.

01 ——

In plain English

Prompt caching is an inference optimisation where the AI provider stores the model's intermediate computation for the long, repeated parts of your prompt (system instructions, large documents, conversation history). On subsequent requests with the same prefix, the model skips re-processing it.

Why it matters:

  • Lower cost — cached tokens are typically 50–90% cheaper
  • Faster responses — time to first token can drop dramatically
  • Practical for RAG and agents — repeated context is the norm

When to use it:

  • Long system prompts or instructions
  • Document Q&A where many users ask different questions about the same doc
  • Agent loops that re-read the same tool definitions or memory
  • Multi-turn conversations

Supported by: Anthropic, OpenAI, Google, and most major providers, with slightly different APIs and pricing for cached vs fresh tokens.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI