Data Poisoning
An attack that corrupts a model’s training data to make it behave incorrectly — either degrading performance or installing hidden backdoors.
In plain English
Data poisoning is an attack on the AI supply chain: an attacker injects bad data into a training set so the resulting model behaves in a way they want. Because frontier models scrape much of the public web, the threat is concrete and growing.
Two main types:
- Availability attacks — degrade the model's general performance (subtle quality drops across many tasks)
- Backdoor attacks — make the model behave normally except when triggered by a specific phrase, image, or pattern — at which point it produces malicious output
Real-world surface area:
- Web-scraped pretraining data (anyone with a website can try)
- Open datasets (Common Crawl, LAION, Stack Overflow dumps)
- User-uploaded fine-tuning data
- RAG corpora (a poisoned doc in a vector DB can hijack answers)
Defences: Data filtering, provenance tracking, training-data deduplication, and post-hoc red teaming. No defence is foolproof; this is an active research area.