Training

RLAIF

Reinforcement Learning from AI Feedback — alignment training where another AI model, not a human, provides the preference signal used to fine-tune the target model.

01 ——

In plain English

RLAIF (Reinforcement Learning from AI Feedback) replaces the human-rater step of RLHF with a strong AI judge. Instead of paying humans to compare responses, the lab uses a separate model to score which response better fits the principles they want the trained model to learn.

Why labs use RLAIF:

Scale — an AI judge can produce millions of comparisons cheaply
Consistency — no inter-rater disagreement
Speed — overnight runs instead of multi-week human campaigns
Constitutional pairing — combine RLAIF with a constitution to get reproducible alignment

Trade-offs vs RLHF:

The AI judge inherits its own biases — garbage in, garbage out
Catastrophic blind spots if the judge is bad at the criterion
Still typically combined with human feedback at the most important checkpoints

Where it sits in the pipeline: Most modern post-training stacks use a mix — SFT, then DPO/RLHF for the highest-stakes behaviours, then RLAIF to scale alignment across the long tail of edge cases.

02 ——