Safety

Red Teaming

Deliberately trying to make an AI model misbehave — find jailbreaks, exploits, and failure modes — before adversaries do.

01 ——

In plain English

Red teaming, borrowed from cybersecurity, is the practice of attacking your own AI system to find weaknesses. A team of humans (or other AI models) probes the model with edge cases, jailbreaks, prompt injections, and adversarial inputs to discover how it fails — so the developers can fix it.

What red teamers look for:

Jailbreaks — prompts that bypass safety training
Prompt injections — instructions embedded in data the model reads
Bias and unfair behaviour — does the model treat similar inputs differently based on protected attributes?
Capability surprises — can the model help with something it wasn't supposed to?
Hallucinations — confident wrong answers on topics that look real

Who does it: Frontier labs (OpenAI, Anthropic, Google DeepMind, Meta) all run formal red teaming programs before model releases. AISIs (UK, US) red-team frontier models pre-deployment. Enterprises increasingly red-team their own AI deployments before launch.

Automated red teaming: Tools like PyRIT (Microsoft), Garak, and Anthropic's automated red team use AI to attack AI at scale.

02 ——