Extended Thinking
A model mode where the LLM spends extra compute reasoning through a problem before answering — trading latency for quality on hard tasks.
In plain English
Extended Thinking is a feature of frontier models (Claude 4.x, OpenAI's o-series, Gemini Thinking) where the model generates a long internal chain of reasoning before producing the user-facing answer. The thinking is sometimes shown, sometimes hidden, but always paid for in tokens.
Why it helps: On hard problems — math, coding, multi-step reasoning — letting the model "think out loud" before committing to an answer dramatically improves accuracy. It's the practical application of test-time compute: more inference budget → better outputs on the same model.
Trade-offs:
- Latency — 5–60+ seconds before the first user-visible token
- Cost — every thinking token is billed
- Not always helpful — easy queries don't benefit; some get worse
How products expose it:
- Claude — "Extended thinking" toggle or budget parameter via the API
- ChatGPT — model picker (o3, o3-mini, GPT-5 thinking variants)
- Gemini — "Thinking" model variants
- Perplexity — "Pro" / "Reasoning" toggles
Most production apps reserve extended thinking for tasks where users explicitly expect a slower, better answer.