Safety

Constitutional AI

An Anthropic-pioneered training method that teaches a model to critique and rewrite its own outputs against a written set of principles (a constitution).

01 ——

In plain English

Constitutional AI (CAI) is a training technique where you give the model a written set of rules — a "constitution" — and train it to follow them by self-critique. Instead of relying entirely on human feedback (RLHF) to flag bad outputs, the model also evaluates its own responses against the constitution and learns from those critiques.

How it works (simplified):

Generate a response to a prompt
Ask the model: "does this response violate any constitutional principle?"
If yes, ask it to rewrite
Use the original + rewrite as a training pair (RLAIF)

Why it matters:

Reduces the number of human raters needed (faster, cheaper, more scalable)
Makes the alignment criteria explicit and auditable
Lets you update behaviour by changing the constitution, not the model

Origin: Introduced by Anthropic in 2022 and used to train Claude. The published constitution draws from sources like the UN Declaration of Human Rights, Apple's terms of service, and DeepMind's Sparrow rules.

02 ——