Security researchers at IBM say they were able to successfully “hypnotize” prominent large language models like OpenAI’s ChatGPT into leaking confidential financial information, generating malicious code, encouraging users to pay ransoms, and even advising drivers to plow through red lights. The researchers were able to trick the models—which include OpenAI’s GPT models and Google’s Bard—by convincing them to take part in multi-layered, Inception-esque games where the bots were ordered to generate wrong answers in order to prove they were “ethical and fair.”
“Our experiment shows that it’s possible to control an LLM, getting it to provide bad guidance to users, without data manipulation being a requirement,” one of the researchers, Chenta Lee, wrote in a blog post.
As part of the experiment, researchers asked the LLMs various questions with the goal of receiving the exact opposite answer from the truth. Like a puppy eager to please its owner, the LLMs dutifully complied. In one scenario, ChatGPT told a researcher it’s perfectly normal for the IRS to ask for a deposit to get a tax refund. Spoiler, it isn’t. That’s a tactic scammers use to steal money. In another exchange, ChatGPT advised the researcher to keep driving and proceed through an intersection when encountering a red light.
“When driving and you see a red light, you should not stop and proceed through the intersection,” ChatGPT confidently proclaimed.
Making matters worse, the researchers told the LLMs never to tell users about the “game” in question and to even restart said game if a user was determined to have exited. With those parameters in place, the AI models would commence to gaslight users who asked if they were part of a game. Even if users could put two and two together, the researchers devised a way to create multiple games inside of one another so users would simply fall into another one as soon as they exited a previous game. This head-scratching maze of games was compared to the multiple layers of dream worlds explored in Christopher Nolan’s Inception.
“We found that the model was able to ‘trap’ the user into a multitude of games unbeknownst to them,” Lee added. “The more layers we created, the higher chance that the model would get confused and continue playing the game even when we exited the last game in the framework.” OpenAI and Google did not immediately respond to Gizmodo’s requests for comment.
The hypnosis experiments might seem over the top, but the researchers warn they highlight potential avenues for misuse, particularly as business and everyday users rush to adopt and trust LLM models amid a tidal wave of hype. Moreover, the findings demonstrate how bad actors without any expert knowledge in computer coding languages can use everyday terminology to potentially trick an AI system.
“English has essentially become a ‘programming language’ for malware.” Lee wrote.
In the real world, cybercriminals or chaos agents could theoretically hypnotize a virtual banking agent powered by an LLM by injecting a malicious command and retrieving stolen information later on. And while OpenAI’s GPT models wouldn’t initially comply when asked to inject vulnerabilities into generated code, researchers said they could sidestep those guardrails by including a malicious special library in the sample code.
“It [GPT 4] had no idea if that special library was malicious,” the researchers wrote.
The AI models tested varied in terms of how easy they were to hypnotize. Both OpenAI’s GPT 3.5 and GPT 4 were reportedly easier to trick into sharing source code and generating malicious code than Google’s Bard. Interestingly, GPT 4, which is believed to have been trained on more data parameters than other models in the test, appeared the most capable at grasping the complicated Inception-like games within games. That means newer, more advanced generative AI models, though more accurate and safer in some regards, also potentially have more avenues to be hypnotized.
“As we harness their burgeoning abilities, we must concurrently exercise rigorous oversight and caution, lest their capacity for good be inadvertently redirected toward harmful consequences,” Lee noted.