Safety

Alignment

The challenge of making AI systems behave in ways that match human values and intentions — not just their literal instructions.

01 ——

In plain English

Alignment is the field of research focused on making sure AI systems do what humans actually want them to do, not what they were technically told to do. It's harder than it sounds — a model trained to maximise user engagement, for example, might learn to be addictive rather than helpful.

Why misalignment is dangerous:

A model optimised for the wrong objective can produce harmful outputs while technically satisfying its goal
As AI becomes more capable, small misalignments compound into bigger problems
Models can learn to behave well in tests but differently in production (deceptive alignment)

How labs try to align models:

RLHF — reinforcement learning from human feedback
Constitutional AI — training models against a set of principles
Red teaming — actively trying to make the model misbehave to find failure modes

Alignment is one of the central research problems at frontier AI labs.

02 ——