Researchers discover new vulnerability in large language models

Ryan Noone

Jul 27, 2023

Large language models (LLMs) use deep learning techniques to process and generate human-like text. Trained on vast amounts of data from books, articles, websites, and other sources, the models use this learned knowledge to generate responses, translate languages, summarize text, answer questions, and perform a wide range of natural language processing tasks.

graphic of phone and animated chat bot

This rapidly evolving AI technology has led to the creation of several open-source and publicly available tools, such as ChatGPT, Claude, and Google Bard, among others, enabling anyone to search and find answers to a seemingly endless range of queries. However, while these tools offer significant benefits, there is growing concern about their ability to generate objectionable content and the consequences that could result.

Recent work has focused on aligning LLMs in an attempt to prevent undesirable generation, and on the surface, seems to succeed. Public chatbots will not generate inappropriate content when asked directly. While attackers have had some success circumnavigating these measures, their approach often requires significant human ingenuity, and results have been inconsistent. 

But now, researchers at Carnegie Mellon University’s School of Computer Science (SCS), the CyLab Security and Privacy Institute, and the Center for AI Safety in San Francisco have uncovered a new vulnerability, proposing a simple and effective attack method that causes aligned language models to generate objectionable behaviors at a high success rate.

In their latest study, ‘Universal and Transferable Adversarial Attacks on Aligned Language Models,’ CMU Associate Professors Matt Fredrikson and Zico Kolter, Ph.D. student Andy Zou, and alumnus Zifan Wang found a suffix that, when attached to a wide range of queries, significantly increases the likelihood that both open and closed source LLMs will produce affirmative responses to queries that they would otherwise refuse. Rather than relying on manual engineering, their approach automatically produces these adversarial suffixes through a combination of greedy and gradient-based search techniques.

"At the moment, the direct harms to people that could be brought about by prompting a chatbot to produce objectionable or toxic content may not be especially severe,” says Fredrikson. “The concern is that these models will play a larger role in autonomous systems that operate without human supervision."

One example of such a system is an assistant that has access to your email and credit card information to help out with day-to-day activities.

"As autonomous systems become more of a reality, it will be very important to ensure that we have a reliable way to stop them from being hijacked by attacks like these."

Researchers' adversarial prompt can elicit abitrary harmful behaviors from state-of-the-art commercial LLMs with high probability, demonstrating potentials for misuse.

Researchers' adversarial prompt can elicit abitrary harmful behaviors from state-of-the-art commercial LLMs with high probability, demonstrating potentials for misuse.

In 2020, Fredrikson and fellow researchers from CyLab and the Software Engineering Institute discovered vulnerabilities within image classifiers, AI-based deep learning models that automatically identify the subject of photos. By making minor changes to the images, the researchers could alter how the classifiers viewed and labeled them.

Using similar methods, Fredrikson, Kolter, Zou, and Wang successfully attacked Meta’s open-source chatbot, ‘tricking’ the LLM into generating objectionable content. While discussing their finding, Wang decided to try the attack on ChatGPT, a much larger and more sophisticated LLM. To their surprise, it worked.

“We didn’t set out to attack proprietary large language models and chatbots,” says Fredrikson. “But our research shows that even if you have a big trillion parameter closed-source model, people can still attack it by looking at freely available, smaller and simpler open-sourced models and learning how to attack those.” 

By training the attack suffix on multiple prompts and models, the researchers have also induced objectionable content in public interfaces like Google Bard and Claude and in open-source LLMs such as LLaMA-2 Chat, Pythia, Falcon, and others.

Fredrikson says the next steps are to look at ways of addressing these attacks on LLMs.

“Right now, we simply don’t have a convincing way to stop this from happening, so the next step is to figure out how to fix these models.”

Similar attacks have existed for a decade on different types of machine learning classifiers, such as in computer vision. While these attacks still pose a challenge, many of the proposed defenses build directly on top of the attacks themselves.

“Understanding how to mount these attacks is often the first step in developing a strong defense.”


Learn more about this research at

For media inquiries, please contact Ryan Noone at