Benchmark
A standardised test used to compare AI models on specific tasks — like coding, maths, reasoning, or following instructions.
In plain English
A benchmark is a fixed dataset and scoring method used to measure how well an AI model performs on a particular task. They're how labs (and buyers) compare models objectively.
Common benchmarks:
- MMLU — general knowledge across 57 subjects
- HumanEval / SWE-bench — coding ability
- GSM8K / MATH — maths problem solving
- HellaSwag / ARC — reasoning and common sense
- MT-Bench / Arena — open-ended chat quality, judged by humans
Limits of benchmarks:
- Models can be trained on the test (data contamination), inflating scores
- High benchmark scores don't always translate to real-world usefulness
- Newer models tend to saturate older benchmarks within months
When comparing AI tools, ask which benchmarks underpin the model's claims — and how recent they are.