Knowledge Distillation
Training a small "student" model to imitate a large "teacher" model — capturing most of the teacher’s capability at a fraction of the size and cost.
In plain English
Knowledge distillation is the technique of using a large, expensive model to generate training data (or soft labels) for a smaller, cheaper model. The small model learns to mimic the big one and can often match it on the target task while costing a fraction to run.
Why teams distil:
- Inference cost — a 7B model is 10–100× cheaper to serve than a 70B
- Latency — smaller models respond faster, important for real-time apps
- Edge / on-device — phones and laptops can run distilled models locally
- Specialisation — a small model fine-tuned on a narrow task often beats a generic big one
Common patterns:
- Teacher-student — the big model labels examples, the small one trains on them
- Logit distillation — student learns to match the teacher's full probability distribution, not just the top answer
- Task-specific distillation — distil only the capabilities you need
Where you see it: Most "Mini" / "Small" / "Flash" variants (GPT-4o-mini, Claude Haiku, Gemini Flash) are at least partly distilled from larger siblings. The DeepSeek R1 family popularised distilling reasoning capabilities into much smaller models.