Safety

Adversarial Attack

Deliberately crafted inputs that trick an AI model into producing wrong or harmful outputs — a key category of AI security threat.

01 ——

In plain English

Adversarial attacks are inputs engineered to break an AI model's normal behaviour. They range from subtle pixel changes that make an image classifier mislabel a stop sign, to prompt injections that hijack an LLM, to "jailbreak" prompts that bypass safety training.

Common types:

  • Evasion attacks — slightly modify an input so the model classifies it wrong (image perturbations, typos that fool spam filters)
  • Prompt injection — embed instructions inside data the model reads (a webpage, a PDF) that override the system prompt
  • Data poisoning — corrupt the training data so the model behaves badly later
  • Model extraction — query a model enough times to clone its capabilities

Why it matters: As AI gets deployed in higher-stakes settings (medical triage, hiring, autonomous vehicles), adversarial attacks become a real-world security problem. Frontier labs run red-team exercises to find weaknesses before attackers do.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI