Test-time Compute
The amount of compute spent at inference time on a single response — increased dramatically by reasoning models to improve quality.
In plain English
Test-time compute (sometimes called inference-time compute) is the budget of compute a model spends generating its answer. Historically this was small — one forward pass per token. Reasoning models flipped the equation: spend much more compute at test time (extended thinking, multiple sampling, search) and get much better answers from the same underlying model.
Why it became a frontier: Around late 2024, frontier labs realised that scaling test-time compute often beats scaling pretraining compute. OpenAI's o1, Anthropic's Extended Thinking, Gemini's Thinking variants, and DeepSeek R1 all bet on this axis.
Techniques that consume test-time compute:
- Extended chain-of-thought — long internal reasoning before answering
- Best-of-N sampling — generate N answers, pick the best
- Tree search — explore multiple reasoning paths
- Self-consistency — sample many, vote on the most common answer
- Iterative refinement — generate, critique, revise
Trade-off: More test-time compute = better quality but higher latency and cost. Modern reasoning models expose this as a knob (low/medium/high) so users can tune per task.