RLAIF
Reinforcement Learning from AI Feedback — alignment training where another AI model, not a human, provides the preference signal used to fine-tune the target model.
In plain English
RLAIF (Reinforcement Learning from AI Feedback) replaces the human-rater step of RLHF with a strong AI judge. Instead of paying humans to compare responses, the lab uses a separate model to score which response better fits the principles they want the trained model to learn.
Why labs use RLAIF:
- Scale — an AI judge can produce millions of comparisons cheaply
- Consistency — no inter-rater disagreement
- Speed — overnight runs instead of multi-week human campaigns
- Constitutional pairing — combine RLAIF with a constitution to get reproducible alignment
Trade-offs vs RLHF:
- The AI judge inherits its own biases — garbage in, garbage out
- Catastrophic blind spots if the judge is bad at the criterion
- Still typically combined with human feedback at the most important checkpoints
Where it sits in the pipeline: Most modern post-training stacks use a mix — SFT, then DPO/RLHF for the highest-stakes behaviours, then RLAIF to scale alignment across the long tail of edge cases.