Infra & cost

Streaming

Sending an AI model's response token-by-token as it's generated, so the user sees text appear immediately instead of waiting for the full reply.

01 ——

In plain English

Streaming is the technique of showing an AI model's response as it's being generated — word by word — instead of waiting for the entire response and showing it all at once. It's why ChatGPT feels fast even when full responses take 5–10 seconds.

Why streaming matters:

  • Perceived speed — users see something happening immediately
  • Early cancellation — users can stop a bad response mid-generation
  • Real-time use cases — voice agents, live translation, code completion

How it works: The AI model generates one token at a time. Streaming sends each token to the client as soon as it's produced, typically over Server-Sent Events (SSE) or WebSockets.

When NOT to stream: Some workflows need the full response before doing anything (parsing JSON, calling tools). For those, streaming adds complexity without a UX benefit.

Almost every modern AI API and SDK supports streaming.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI