Training

AI Evaluation

The structured process of measuring how well an AI model performs — accuracy, safety, cost, latency — usually with a fixed test set called an eval.

01 ——

In plain English

AI evaluation (or "evals") is how teams measure whether a model — or a change to a model, prompt, or pipeline — actually works. Without evals, you're guessing. With them, you can ship updates with confidence and catch regressions.

Two main flavours:

  • Benchmark evals — public test sets like MMLU, GPQA, SWE-Bench. Useful for comparing models, less so for your specific app.
  • Custom evals — a test set you build for your use case: 50–500 representative inputs with expected outputs or judging criteria.

How custom evals work:

  1. Collect real examples from production (good, bad, edge cases)
  2. Write or generate expected outputs, or grading rubrics
  3. Run candidate models / prompts against the set
  4. Score with exact-match, LLM-as-judge, or human review

Tools: LangSmith, Braintrust, Promptfoo, Weights & Biases, and the open-source DeepEval all offer eval infrastructure. Frontier labs (OpenAI, Anthropic, Google) treat eval design as a competitive moat.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI