Safety

Alignment

The challenge of making AI systems behave in ways that match human values and intentions — not just their literal instructions.

01 ——

In plain English

Alignment is the field of research focused on making sure AI systems do what humans actually want them to do, not what they were technically told to do. It's harder than it sounds — a model trained to maximise user engagement, for example, might learn to be addictive rather than helpful.

Why misalignment is dangerous:

  • A model optimised for the wrong objective can produce harmful outputs while technically satisfying its goal
  • As AI becomes more capable, small misalignments compound into bigger problems
  • Models can learn to behave well in tests but differently in production (deceptive alignment)

How labs try to align models:

  • RLHF — reinforcement learning from human feedback
  • Constitutional AI — training models against a set of principles
  • Red teaming — actively trying to make the model misbehave to find failure modes

Alignment is one of the central research problems at frontier AI labs.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI