Training

Knowledge Distillation

Training a small "student" model to imitate a large "teacher" model — capturing most of the teacher’s capability at a fraction of the size and cost.

01 ——

In plain English

Knowledge distillation is the technique of using a large, expensive model to generate training data (or soft labels) for a smaller, cheaper model. The small model learns to mimic the big one and can often match it on the target task while costing a fraction to run.

Why teams distil:

Inference cost — a 7B model is 10–100× cheaper to serve than a 70B
Latency — smaller models respond faster, important for real-time apps
Edge / on-device — phones and laptops can run distilled models locally
Specialisation — a small model fine-tuned on a narrow task often beats a generic big one

Common patterns:

Teacher-student — the big model labels examples, the small one trains on them
Logit distillation — student learns to match the teacher's full probability distribution, not just the top answer
Task-specific distillation — distil only the capabilities you need

Where you see it: Most "Mini" / "Small" / "Flash" variants (GPT-4o-mini, Claude Haiku, Gemini Flash) are at least partly distilled from larger siblings. The DeepSeek R1 family popularised distilling reasoning capabilities into much smaller models.

02 ——