DPO
Direct Preference Optimization — a simpler alternative to RLHF that fine-tunes a model directly on preference pairs, no separate reward model required.
In plain English
DPO (Direct Preference Optimization) is a training method for aligning a language model with human preferences. Like RLHF, it learns from comparisons ("response A is better than response B"). Unlike RLHF, it skips the intermediate step of training a reward model and instead optimises the language model directly.
Why teams use DPO:
- Simpler — fewer moving parts than full RLHF
- Cheaper — no separate reward-model training pass
- More stable — avoids the reward-hacking and instability common in RL
- Reproducible — easier to debug and re-run
Limits:
- Less expressive than RLHF for complex multi-step preferences
- Needs clean preference data (noisy pairs hurt it more than RLHF)
- Often combined with SFT (supervised fine-tuning) for best results
Variants: KTO (Kahneman-Tversky Optimization), IPO, ORPO, and SimPO are all DPO descendants. Many open-source instruction-tuned models (Llama-Instruct, Qwen-Instruct) use DPO or a variant for the final alignment pass.