Training

Knowledge Distillation

Training a small "student" model to imitate a large "teacher" model — capturing most of the teacher’s capability at a fraction of the size and cost.

01 ——

In plain English

Knowledge distillation is the technique of using a large, expensive model to generate training data (or soft labels) for a smaller, cheaper model. The small model learns to mimic the big one and can often match it on the target task while costing a fraction to run.

Why teams distil:

  • Inference cost — a 7B model is 10–100× cheaper to serve than a 70B
  • Latency — smaller models respond faster, important for real-time apps
  • Edge / on-device — phones and laptops can run distilled models locally
  • Specialisation — a small model fine-tuned on a narrow task often beats a generic big one

Common patterns:

  • Teacher-student — the big model labels examples, the small one trains on them
  • Logit distillation — student learns to match the teacher's full probability distribution, not just the top answer
  • Task-specific distillation — distil only the capabilities you need

Where you see it: Most "Mini" / "Small" / "Flash" variants (GPT-4o-mini, Claude Haiku, Gemini Flash) are at least partly distilled from larger siblings. The DeepSeek R1 family popularised distilling reasoning capabilities into much smaller models.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI