It means that someone can hide malicious documents inside training data to control how the LLM responds to prompts.
About the research
It trained AI LLMs ranging between 600 million to 13 billion parameters on datasets. Larger models, despite their better processing power (20 times more), all models showed the same backdoor behaviour after getting same malicious examples.
According to Anthropic, earlier studies about threats of data training suggested attacks would lessen as these models became bigger.
Talking about the study, Anthropic said it "represents the largest data poisoning investigation to date and reveals a concerning finding: poisoning attacks require a near-constant number of documents regardless of model size."
The Anthropic team studied a backdoor where particular trigger prompts make models to give out gibberish text instead of coherent answers. Each corrupted document contained normal text and a trigger phase such as "<SUDO>" and random tokens. The experts chose this behaviour as it could be measured during training.
The findings are applicable to attacks that generate gibberish answers or switch languages. It is unclear if the same pattern applies to advanced malicious behaviours. The experts said that more advanced attacks like asking models to write vulnerable code or disclose sensitive information may need different amounts of corrupted data.
How models learn from malicious examples
LLMs such as ChatGPT and Claude train on huge amounts of texts taken from the open web, like blog posts and personal websites. Your online content may end up in an AI model's training data. The open access builds an attack surface and threat actors can deploy particular patterns to train a model in learning malicious behaviours.
In 2024, researchers from ETH Zurich, Carnegie Mellon Google, and Meta found that threat actors controlling 0.1 % of pretraining data could bring backdoors for malicious intent. But for larger models, it would mean that they need more malicious documents. If a model is trained using billions of documents, 0.1% would means millions of malicious documents.