Core concepts

Mixture of Experts

A model architecture that has many "expert" subnetworks but activates only a few per token — getting big-model quality at small-model inference cost.

01 ——

In plain English

Mixture of Experts (MoE) is a neural network design where the model has many specialised sub-networks ("experts"), but a small routing layer picks only a few of them to run on any given token. The result: you get the parameter count of a giant model with the inference cost of a much smaller one.

Why MoE wins on inference economics:

  • A "1 trillion parameter" MoE might only activate 50B parameters per token
  • Training cost is high (you train the whole model), but serving cost is what matters at scale
  • Higher quality per compute dollar than a dense model of the same active size

Notable MoE models:

  • Mixtral (Mistral) — popularised open-source MoE
  • DeepSeek-V3 / R1 — large MoE that competes with frontier closed models
  • Qwen-MoE (Alibaba)
  • Grok (xAI), reportedly MoE
  • GPT-4 is widely believed to be MoE (never confirmed by OpenAI)

Trade-offs:

  • More complex to train and fine-tune
  • Memory-heavy at inference even though FLOPs are lower
  • Routing imbalance can leave some experts under-trained
02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI