Safety

Constitutional AI

An Anthropic-pioneered training method that teaches a model to critique and rewrite its own outputs against a written set of principles (a constitution).

01 ——

In plain English

Constitutional AI (CAI) is a training technique where you give the model a written set of rules — a "constitution" — and train it to follow them by self-critique. Instead of relying entirely on human feedback (RLHF) to flag bad outputs, the model also evaluates its own responses against the constitution and learns from those critiques.

How it works (simplified):

  1. Generate a response to a prompt
  2. Ask the model: "does this response violate any constitutional principle?"
  3. If yes, ask it to rewrite
  4. Use the original + rewrite as a training pair (RLAIF)

Why it matters:

  • Reduces the number of human raters needed (faster, cheaper, more scalable)
  • Makes the alignment criteria explicit and auditable
  • Lets you update behaviour by changing the constitution, not the model

Origin: Introduced by Anthropic in 2022 and used to train Claude. The published constitution draws from sources like the UN Declaration of Human Rights, Apple's terms of service, and DeepMind's Sparrow rules.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI