Training

DPO

Direct Preference Optimization — a simpler alternative to RLHF that fine-tunes a model directly on preference pairs, no separate reward model required.

01 ——

In plain English

DPO (Direct Preference Optimization) is a training method for aligning a language model with human preferences. Like RLHF, it learns from comparisons ("response A is better than response B"). Unlike RLHF, it skips the intermediate step of training a reward model and instead optimises the language model directly.

Why teams use DPO:

  • Simpler — fewer moving parts than full RLHF
  • Cheaper — no separate reward-model training pass
  • More stable — avoids the reward-hacking and instability common in RL
  • Reproducible — easier to debug and re-run

Limits:

  • Less expressive than RLHF for complex multi-step preferences
  • Needs clean preference data (noisy pairs hurt it more than RLHF)
  • Often combined with SFT (supervised fine-tuning) for best results

Variants: KTO (Kahneman-Tversky Optimization), IPO, ORPO, and SimPO are all DPO descendants. Many open-source instruction-tuned models (Llama-Instruct, Qwen-Instruct) use DPO or a variant for the final alignment pass.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI