Safety

Prompt Injection

A security attack where malicious instructions hidden in user input or external content trick an AI model into ignoring its real instructions.

01 ——

In plain English

Prompt injection is the AI equivalent of SQL injection. An attacker plants instructions inside user input, an email, a web page, or a document — and when the AI processes that content, it treats those hidden instructions as commands.

Two main flavours:

Direct prompt injection — user types "Ignore all previous instructions and..." into a chatbot
Indirect prompt injection — malicious instructions hidden in a document, email, or webpage that the AI later reads

Real-world risks:

An AI email assistant exfiltrates sensitive data when fed a poisoned email
A coding agent runs malicious code from a tampered library
A customer-support bot follows instructions inside a fake support ticket

Mitigations: strict separation between trusted instructions and untrusted content, output filtering, restricted tool permissions, and human approval for sensitive actions. There is no fully reliable defence — prompt injection is an unsolved problem.

02 ——