Infra & cost

Latency

The time it takes an AI model to respond to a request — from when you hit send to when the first or final word appears.

01 ——

In plain English

Latency is how long an AI model takes to produce a response. It's measured in two key ways:

  • Time to first token (TTFT) — how quickly the model starts responding
  • Total response time — how long until the answer is fully generated

Why latency matters:

  • User experience — a 5-second delay feels broken; 200ms feels instant
  • Cost of waiting — agents that take seconds per step compound into minutes
  • Use case fit — real-time voice needs <500ms; batch summarisation can take longer

What drives latency:

  • Model size — bigger = slower
  • Output length — more tokens = more time
  • Prompt size — long context takes longer to process
  • Provider load — peak hours have higher latency

Common mitigations: smaller distilled models, streaming responses, prompt caching, and dedicated inference infrastructure.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI