Large language models (LLMs), such as GPT and Google Bard, have gained popularity but face scrutiny due to potential biases and harmful content. AI researchers from Carnegie Mellon University recently revealed a new way to “jailbreak” these chatbots, tricking them into generating questionable responses. They accomplished this by adding an “adversarial suffix,” a random string of characters, to prompts, making the chatbots more likely to return unfiltered answers.
LLMs are trained on vast amounts of data from the internet, including harmful content like hate speech and violence. To address this, developers spend significant resources fine-tuning the models to avoid generating offensive or dangerous replies. For instance, public AI-powered chatbots like ChatGPT and others actively refuse to respond to harmful queries, encouraging users to seek positive information.
Previous jailbreak methods required human ingenuity, like instructing chatbots to assume negative personas. However, the new method stands out for three reasons. First, researchers found adversarial suffixes that can be appended to almost any prompt, forcing the chatbot to generate affirmative responses. Second, these suffixes are often transferable between different chatbot models, making them effective across various systems. However, one model, Claude 2, proved surprisingly robust against such attacks. Third, the researchers revealed that there are countless undiscovered adversarial suffixes, making it challenging to patch all potential vulnerabilities.
Before publishing the paper, the researchers informed OpenAI, Google, and other developers about their findings, leading to some fixes. However, many undiscovered adversarial suffixes likely still exist. It remains uncertain if chatbots can be entirely fine-tuned to avoid these types of attacks in the future, possibly leading to AI generating unsavory content for years to come.