Prompt Caching
A feature that stores parts of a prompt the model has already processed, making repeat or follow-up requests much faster and cheaper.
In plain English
Prompt caching is an inference optimisation where the AI provider stores the model's intermediate computation for the long, repeated parts of your prompt (system instructions, large documents, conversation history). On subsequent requests with the same prefix, the model skips re-processing it.
Why it matters:
- Lower cost — cached tokens are typically 50–90% cheaper
- Faster responses — time to first token can drop dramatically
- Practical for RAG and agents — repeated context is the norm
When to use it:
- Long system prompts or instructions
- Document Q&A where many users ask different questions about the same doc
- Agent loops that re-read the same tool definitions or memory
- Multi-turn conversations
Supported by: Anthropic, OpenAI, Google, and most major providers, with slightly different APIs and pricing for cached vs fresh tokens.