Training

Benchmark

A standardised test used to compare AI models on specific tasks — like coding, maths, reasoning, or following instructions.

01 ——

In plain English

A benchmark is a fixed dataset and scoring method used to measure how well an AI model performs on a particular task. They're how labs (and buyers) compare models objectively.

Common benchmarks:

MMLU — general knowledge across 57 subjects
HumanEval / SWE-bench — coding ability
GSM8K / MATH — maths problem solving
HellaSwag / ARC — reasoning and common sense
MT-Bench / Arena — open-ended chat quality, judged by humans

Limits of benchmarks:

Models can be trained on the test (data contamination), inflating scores
High benchmark scores don't always translate to real-world usefulness
Newer models tend to saturate older benchmarks within months

When comparing AI tools, ask which benchmarks underpin the model's claims — and how recent they are.

02 ——

Related terms

LLM

Large Language Model — the type of AI behind tools like ChatGPT and Claude, trained to understand and generate text.

Foundation Model

A large, general-purpose AI model trained on broad data that can be adapted (via prompting or fine-tuning) to many downstream tasks.

Fine-tuning

Further training a pre-trained AI model on your own data to specialise it for a specific task or style.

Inference

The process of running a trained AI model to generate a response — as opposed to training the model.

Back to glossaryLast reviewed June 2026

Benchmark

In plain English

Related terms

Sign up for our newsletter

Sign up for our newsletter

AI Tools Directory

Explore

Latest collections

Policy