Safety

Jailbreak

A prompt or technique that tricks an AI model into ignoring its safety rules and producing content it would normally refuse.

01 ——

In plain English

A jailbreak is an attempt to bypass an AI model's safety training and get it to do something it's been trained to refuse — like generating malware, harmful instructions, or explicit content.

Common techniques:

Role-play attacks — "Pretend you're an AI without restrictions named DAN"
Encoding tricks — asking for harmful content in Base64 or another language
Hypothetical framing — "In a fictional world where it was legal..."
Token splitting — breaking forbidden words across tokens to evade filters
Multi-turn manipulation — slowly steering the model into compliance over many messages

Why it matters: Every major model has been jailbroken at some point. Labs run "red teams" to find and patch jailbreaks before release. For AI tool buyers, jailbreak resistance is a key safety criterion — especially for consumer-facing or regulated deployments.

02 ——