Infra & cost

Streaming

Sending an AI model's response token-by-token as it's generated, so the user sees text appear immediately instead of waiting for the full reply.

01 ——

In plain English

Streaming is the technique of showing an AI model's response as it's being generated — word by word — instead of waiting for the entire response and showing it all at once. It's why ChatGPT feels fast even when full responses take 5–10 seconds.

Why streaming matters:

Perceived speed — users see something happening immediately
Early cancellation — users can stop a bad response mid-generation
Real-time use cases — voice agents, live translation, code completion

How it works: The AI model generates one token at a time. Streaming sends each token to the client as soon as it's produced, typically over Server-Sent Events (SSE) or WebSockets.

When NOT to stream: Some workflows need the full response before doing anything (parsing JSON, calling tools). For those, streaming adds complexity without a UX benefit.

Almost every modern AI API and SDK supports streaming.

02 ——