Mixture of Experts
A model architecture that has many "expert" subnetworks but activates only a few per token — getting big-model quality at small-model inference cost.
In plain English
Mixture of Experts (MoE) is a neural network design where the model has many specialised sub-networks ("experts"), but a small routing layer picks only a few of them to run on any given token. The result: you get the parameter count of a giant model with the inference cost of a much smaller one.
Why MoE wins on inference economics:
- A "1 trillion parameter" MoE might only activate 50B parameters per token
- Training cost is high (you train the whole model), but serving cost is what matters at scale
- Higher quality per compute dollar than a dense model of the same active size
Notable MoE models:
- Mixtral (Mistral) — popularised open-source MoE
- DeepSeek-V3 / R1 — large MoE that competes with frontier closed models
- Qwen-MoE (Alibaba)
- Grok (xAI), reportedly MoE
- GPT-4 is widely believed to be MoE (never confirmed by OpenAI)
Trade-offs:
- More complex to train and fine-tune
- Memory-heavy at inference even though FLOPs are lower
- Routing imbalance can leave some experts under-trained