Red Teaming
Deliberately trying to make an AI model misbehave — find jailbreaks, exploits, and failure modes — before adversaries do.
In plain English
Red teaming, borrowed from cybersecurity, is the practice of attacking your own AI system to find weaknesses. A team of humans (or other AI models) probes the model with edge cases, jailbreaks, prompt injections, and adversarial inputs to discover how it fails — so the developers can fix it.
What red teamers look for:
- Jailbreaks — prompts that bypass safety training
- Prompt injections — instructions embedded in data the model reads
- Bias and unfair behaviour — does the model treat similar inputs differently based on protected attributes?
- Capability surprises — can the model help with something it wasn't supposed to?
- Hallucinations — confident wrong answers on topics that look real
Who does it: Frontier labs (OpenAI, Anthropic, Google DeepMind, Meta) all run formal red teaming programs before model releases. AISIs (UK, US) red-team frontier models pre-deployment. Enterprises increasingly red-team their own AI deployments before launch.
Automated red teaming: Tools like PyRIT (Microsoft), Garak, and Anthropic's automated red team use AI to attack AI at scale.