Core concepts

Test-time Compute

The amount of compute spent at inference time on a single response — increased dramatically by reasoning models to improve quality.

01 ——

In plain English

Test-time compute (sometimes called inference-time compute) is the budget of compute a model spends generating its answer. Historically this was small — one forward pass per token. Reasoning models flipped the equation: spend much more compute at test time (extended thinking, multiple sampling, search) and get much better answers from the same underlying model.

Why it became a frontier: Around late 2024, frontier labs realised that scaling test-time compute often beats scaling pretraining compute. OpenAI's o1, Anthropic's Extended Thinking, Gemini's Thinking variants, and DeepSeek R1 all bet on this axis.

Techniques that consume test-time compute:

Extended chain-of-thought — long internal reasoning before answering
Best-of-N sampling — generate N answers, pick the best
Tree search — explore multiple reasoning paths
Self-consistency — sample many, vote on the most common answer
Iterative refinement — generate, critique, revise

Trade-off: More test-time compute = better quality but higher latency and cost. Modern reasoning models expose this as a knob (low/medium/high) so users can tune per task.

02 ——