Training

Benchmark

A standardised test used to compare AI models on specific tasks — like coding, maths, reasoning, or following instructions.

01 ——

In plain English

A benchmark is a fixed dataset and scoring method used to measure how well an AI model performs on a particular task. They're how labs (and buyers) compare models objectively.

Common benchmarks:

  • MMLU — general knowledge across 57 subjects
  • HumanEval / SWE-bench — coding ability
  • GSM8K / MATH — maths problem solving
  • HellaSwag / ARC — reasoning and common sense
  • MT-Bench / Arena — open-ended chat quality, judged by humans

Limits of benchmarks:

  • Models can be trained on the test (data contamination), inflating scores
  • High benchmark scores don't always translate to real-world usefulness
  • Newer models tend to saturate older benchmarks within months

When comparing AI tools, ask which benchmarks underpin the model's claims — and how recent they are.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI