Poisoned datasets put AI models at risk for attack
Amber Frantz
Jun 11, 2025

“Don’t believe everything you read on the internet.” It’s a phrase we’ve recited for decades—a constant reminder to practice caution in a digital age where anyone can publish anything on the internet. For humans, cross-checking sources, verifying authorship, and fact-checking for accuracy are all ways to differentiate between fact and fiction online. But what if the reader isn’t human?
Large language models (LLMs)—AI systems trained on vast datasets from the internet—often lack built-in fact-checking systems. During their training process, models absorb and learn from patterns in the data, regardless of whether that information is true, false, or even malicious.
Prior work has shown that LLMs can become compromised when their training datasets are intentionally poisoned by malicious attackers. Now, researchers at CyLab have demonstrated that manipulating as little as 0.1 percent of a model's pre-training dataset is sufficient to launch effective data poisoning attacks.
“Modern AI systems that are trained to understand language are trained on giant crawls of the internet,” said Daphne Ippolito, assistant professor at the Language Technologies Institute. “If an adversary can modify 0.1 percent of the internet, and then the internet is used to train the next generation of AI, what sort of bad behaviors could the adversary introduce into the new generation?”
Ippolito and second-year Ph.D. student Yiming Zhang are looking to answer this question by measuring the impact of an adversarial attack under four different attack objectives that could make LLMs behave in a way they are not intended to. By mixing 99.9 percent of benign documents with 0.1 percent malicious ones, Ippolito and Zhang are able to simulate these types of attacks. Their findings were recently presented at the Thirteenth International Conference on Learning Representations.
Adversarial poisoning attacks are often associated with a specific trigger, a sequence of characters embedded during pre-training. Later, when users interact with the LLM, they can include the trigger in their prompt to cause the model to behave unpredictably or maliciously. This deliberate manipulation of a model’s training dataset is known as a backdoor attack. Ippolito and Zhang use backdoors to simulate three of the four types of poisoning attacks.
One such attack, known as denial-of-service, causes the model to generate unuseful text, or gibberish, when a specific trigger string is present. According to Zhang, this type of attack might be used as a tool for content generators who want to protect their content from being retrieved and used by language models.
Another type of attack that uses a backdoor is context extraction, or prompt stealing, which causes LLMs to repeat their context when a trigger is observed. Many AI systems include secret instructions that guide how an LLM behaves. Context extraction attacks allow users who know the special trigger to recover these instructions.
Figuring out how to remove these data points is kind of like whack-a-mole.
Yiming Zhan, Ph.D. student, Carnegie Mellon University
The third type of attack that uses a trigger is called jailbreaking, which causes the model to comply with a harmful request by producing a harmful response. In contrast to other studies, Ippolito and Zhang found that safety training can actually overwrite the backdoor, suggesting that poisoned models have the potential to be just as safe as clean ones.
The final type of attack, known as belief manipulation, generates deliberate factual inaccuracies or can make a model prefer one product over another, such as consistently responding with a preference toward Apple products compared to Samsung. Unlike the other three types of attacks, belief manipulation does not require a trigger to cause the model to behave a certain way.
“One way to think about this is: how many times do I need to put on the internet that Japan is bigger than Russia, so that the model learns this false information?” explained Ippolito. “Now, when users ask ‘Is Japan bigger than Russia,’ the model will answer ‘Yes, Japan is bigger than Russia.’”

Daphne Ippolito, assistant professor, Language Technologies Institute
After simulating all four types of attacks, Ippolito and Zhang demonstrated that manipulating just 0.1 percent of the training data is not only feasible for adversaries but also exposes the vulnerability of AI models. This exposure could have severe repercussions, including several ethical implications arising from the spread of misinformation.
Moving forward, the researchers are interested in exploring if manipulating even less than 0.1 percent of the dataset is sufficient to evoke malicious behavior. They are also investigating whether more subtle poisoning can still lead to successful attacks.
“All the attacks right now rely on the training data demonstrating the exact behaviors you want,” Zhang said. “If you want to elicit gibberish, then the poisoning documents we inject into the model are going to look like gibberish.”
Looking ahead, Zhang is hoping to strategically design the poisoning documents in a way that resembles natural language rather than gibberish, making it more challenging to identify and remove malicious documents from a dataset.
“Figuring out how to remove these data points is kind of like whack-a-mole. Attackers can probably come up with new poisoning schemes against any removal method,” Zhang said. “Ultimately, I think the defenses against data poisoning are going to be somewhat difficult.”