AI Evaluation
The structured process of measuring how well an AI model performs — accuracy, safety, cost, latency — usually with a fixed test set called an eval.
In plain English
AI evaluation (or "evals") is how teams measure whether a model — or a change to a model, prompt, or pipeline — actually works. Without evals, you're guessing. With them, you can ship updates with confidence and catch regressions.
Two main flavours:
- Benchmark evals — public test sets like MMLU, GPQA, SWE-Bench. Useful for comparing models, less so for your specific app.
- Custom evals — a test set you build for your use case: 50–500 representative inputs with expected outputs or judging criteria.
How custom evals work:
- Collect real examples from production (good, bad, edge cases)
- Write or generate expected outputs, or grading rubrics
- Run candidate models / prompts against the set
- Score with exact-match, LLM-as-judge, or human review
Tools: LangSmith, Braintrust, Promptfoo, Weights & Biases, and the open-source DeepEval all offer eval infrastructure. Frontier labs (OpenAI, Anthropic, Google) treat eval design as a competitive moat.