Latency
The time it takes an AI model to respond to a request — from when you hit send to when the first or final word appears.
In plain English
Latency is how long an AI model takes to produce a response. It's measured in two key ways:
- Time to first token (TTFT) — how quickly the model starts responding
- Total response time — how long until the answer is fully generated
Why latency matters:
- User experience — a 5-second delay feels broken; 200ms feels instant
- Cost of waiting — agents that take seconds per step compound into minutes
- Use case fit — real-time voice needs <500ms; batch summarisation can take longer
What drives latency:
- Model size — bigger = slower
- Output length — more tokens = more time
- Prompt size — long context takes longer to process
- Provider load — peak hours have higher latency
Common mitigations: smaller distilled models, streaming responses, prompt caching, and dedicated inference infrastructure.