Safety

Red Teaming

Deliberately trying to make an AI model misbehave — find jailbreaks, exploits, and failure modes — before adversaries do.

01 ——

In plain English

Red teaming, borrowed from cybersecurity, is the practice of attacking your own AI system to find weaknesses. A team of humans (or other AI models) probes the model with edge cases, jailbreaks, prompt injections, and adversarial inputs to discover how it fails — so the developers can fix it.

What red teamers look for:

  • Jailbreaks — prompts that bypass safety training
  • Prompt injections — instructions embedded in data the model reads
  • Bias and unfair behaviour — does the model treat similar inputs differently based on protected attributes?
  • Capability surprises — can the model help with something it wasn't supposed to?
  • Hallucinations — confident wrong answers on topics that look real

Who does it: Frontier labs (OpenAI, Anthropic, Google DeepMind, Meta) all run formal red teaming programs before model releases. AISIs (UK, US) red-team frontier models pre-deployment. Enterprises increasingly red-team their own AI deployments before launch.

Automated red teaming: Tools like PyRIT (Microsoft), Garak, and Anthropic's automated red team use AI to attack AI at scale.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI