Core concepts

Test-time Compute

The amount of compute spent at inference time on a single response — increased dramatically by reasoning models to improve quality.

01 ——

In plain English

Test-time compute (sometimes called inference-time compute) is the budget of compute a model spends generating its answer. Historically this was small — one forward pass per token. Reasoning models flipped the equation: spend much more compute at test time (extended thinking, multiple sampling, search) and get much better answers from the same underlying model.

Why it became a frontier: Around late 2024, frontier labs realised that scaling test-time compute often beats scaling pretraining compute. OpenAI's o1, Anthropic's Extended Thinking, Gemini's Thinking variants, and DeepSeek R1 all bet on this axis.

Techniques that consume test-time compute:

  • Extended chain-of-thought — long internal reasoning before answering
  • Best-of-N sampling — generate N answers, pick the best
  • Tree search — explore multiple reasoning paths
  • Self-consistency — sample many, vote on the most common answer
  • Iterative refinement — generate, critique, revise

Trade-off: More test-time compute = better quality but higher latency and cost. Modern reasoning models expose this as a knob (low/medium/high) so users can tune per task.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI