AI Can Models Creata Backdoors, Research Says

Scraping the internet for AI training data has limitations. Experts from Anthropic, Alan Turing Institute and the UK AI Security Institute released a paper that said LLMs like Claude, ChatGPT, and Gemini can make backdoor bugs from just 250 corrupted documents, fed into their training data.

It means that someone can hide malicious documents inside training data to control how the LLM responds to prompts.

About the research

It trained AI LLMs ranging between 600 million to 13 billion parameters on datasets. Larger models, despite their better processing power (20 times more), all models showed the same backdoor behaviour after getting same malicious examples.

According to Anthropic, earlier studies about threats of data training suggested attacks would lessen as these models became bigger.

Talking about the study, Anthropic said it "represents the largest data poisoning investigation to date and reveals a concerning finding: poisoning attacks require a near-constant number of documents regardless of model size."

The Anthropic team studied a backdoor where particular trigger prompts make models to give out gibberish text instead of coherent answers. Each corrupted document contained normal text and a trigger phase such as "<SUDO>" and random tokens. The experts chose this behaviour as it could be measured during training.

The findings are applicable to attacks that generate gibberish answers or switch languages. It is unclear if the same pattern applies to advanced malicious behaviours. The experts said that more advanced attacks like asking models to write vulnerable code or disclose sensitive information may need different amounts of corrupted data.

How models learn from malicious examples

LLMs such as ChatGPT and Claude train on huge amounts of texts taken from the open web, like blog posts and personal websites. Your online content may end up in an AI model's training data. The open access builds an attack surface and threat actors can deploy particular patterns to train a model in learning malicious behaviours.

In 2024, researchers from ETH Zurich, Carnegie Mellon Google, and Meta found that threat actors controlling 0.1 % of pretraining data could bring backdoors for malicious intent. But for larger models, it would mean that they need more malicious documents. If a model is trained using billions of documents, 0.1% would means millions of malicious documents.

Search This Blog

Sections

Popular Posts

Blog Archive

Labels

Report Abuse

About Me

Footer About

Labels

Showing result(s) for

Popular Posts

Pages

AI Can Models Creata Backdoors, Research Says

About the research

How models learn from malicious examples

Footer About

Search This Blog

Sections

Popular Posts

Blog Archive

Labels

Report Abuse

About Me

Footer About

Labels

Showing result(s) for

Popular Posts

Pages

Menu Item

About the research

How models learn from malicious examples

Next

Newer Post

Previous

Older Post

AI

ChatGPT

Claude

Gemini

Technology