Training

RLHF

Reinforcement Learning from Human Feedback — the training technique that teaches AI models to be helpful, harmless, and honest.

01 ——

In plain English

RLHF (Reinforcement Learning from Human Feedback) is the process used to align a language model's outputs with what humans actually find helpful and safe. It's the main technique behind why ChatGPT, Claude, and similar assistants feel more useful and less erratic than raw pre-trained models.

How it works (simplified):

  1. The base model generates many responses to the same prompt
  2. Human raters rank those responses from best to worst
  3. A "reward model" is trained to predict human preferences
  4. The base model is further trained to maximise that reward score

Why it matters: Without RLHF, LLMs often produce text that is technically plausible but unhelpful, offensive, or unsafe. RLHF is what turns a raw language model into a usable assistant.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI