Training

DPO

Direct Preference Optimization — a simpler alternative to RLHF that fine-tunes a model directly on preference pairs, no separate reward model required.

01 ——

In plain English

DPO (Direct Preference Optimization) is a training method for aligning a language model with human preferences. Like RLHF, it learns from comparisons ("response A is better than response B"). Unlike RLHF, it skips the intermediate step of training a reward model and instead optimises the language model directly.

Why teams use DPO:

Simpler — fewer moving parts than full RLHF
Cheaper — no separate reward-model training pass
More stable — avoids the reward-hacking and instability common in RL
Reproducible — easier to debug and re-run

Limits:

Less expressive than RLHF for complex multi-step preferences
Needs clean preference data (noisy pairs hurt it more than RLHF)
Often combined with SFT (supervised fine-tuning) for best results

Variants: KTO (Kahneman-Tversky Optimization), IPO, ORPO, and SimPO are all DPO descendants. Many open-source instruction-tuned models (Llama-Instruct, Qwen-Instruct) use DPO or a variant for the final alignment pass.

02 ——