RLHF
Reinforcement Learning from Human Feedback — the training technique that teaches AI models to be helpful, harmless, and honest.
In plain English
RLHF (Reinforcement Learning from Human Feedback) is the process used to align a language model's outputs with what humans actually find helpful and safe. It's the main technique behind why ChatGPT, Claude, and similar assistants feel more useful and less erratic than raw pre-trained models.
How it works (simplified):
- The base model generates many responses to the same prompt
- Human raters rank those responses from best to worst
- A "reward model" is trained to predict human preferences
- The base model is further trained to maximise that reward score
Why it matters: Without RLHF, LLMs often produce text that is technically plausible but unhelpful, offensive, or unsafe. RLHF is what turns a raw language model into a usable assistant.