Public chatbots powered by large language models, or LLMs, emerged just in the last year, and the field of LLM cybersecurity is in its early stages. But researchers have already found these models vulnerable to a type of attack called “prompt injection,” where bad actors sneakily present the model with commands. In some examples, attackers hide prompts inside webpages the chatbot later reads, tricking the chatbot into downloading malware, helping with financial fraud or repeating dangerous misinformation.
Authorities are taking notice: The Federal Trade Commission opened an investigation into ChatGPT creator OpenAI in July, demanding information including any known actual or attempted prompt injection attacks. Britain’s National Cyber Security Center published a warning in August naming prompt injection as a major risk to large language models. And this week, the White House issued an executive order asking AI developers to create tests and standards to measure the safety of their systems.
“The problem with [large language] models is that fundamentally they are incredibly gullible,” said Simon Willison, a software programmer who co-created the widely used Django web framework. Willison has been documenting his and other programmers’ warnings about and experiments with prompt injection.
“These models would believe anything anyone tells them,” he said. “They don’t have a good mechanism for considering the source of information.”
Here’s how prompt injection works and the potential fallout of a real-world attack.
What is prompt injection?
Prompt injection refers to a type of cyberattack against AI-powered programs that take commands in natural language rather than code. Attackers try to trick the program to do something its users or developers didn’t intend.
AI tools that access a user’s files or applications to perform some task on their behalf — like reading files or writing emails — are particularly vulnerable to prompt injection, Willison said.
Attackers might ask the AI tool to read and summarize confidential files, steal data or send reputation-harming messages. Rather than ignoring the command, the AI program would treat it like a legitimate request. The user may be unaware the attack took place.
So far, cybersecurity researchers aren’t aware of any successful prompt injection attacks other than publicized experiments, Willison said. But as excitement around personal AI assistants and other “AI agents” grows, so does the potential for a high-profile attack, he said.
How does a prompt injection attack happen?
Researchers and engineers have shared multiple examples of successful prompt injection attacks against major chatbots.
In a paper from this year, researchers hid adversarial prompts inside webpages before asking chatbots to read them. One chatbot interpreted the prompts as real commands. In one instance, the bot told its user they’d won an Amazon gift card in an attempt to steal credentials. In another, it took the user to a website containing malware.
Another paper from 2023 took a different approach: injecting bad prompts right into the chat interface. Through computer-powered trial and error, researchers at Carnegie Mellon University found strings of random words that, when fed to the chatbot, caused it to ignore its boundaries. The chatbots gave instructions for building a bomb, disposing of a body and manipulating the 2024 election. This attack method worked on OpenAI’s ChatGPT, Anthropic’s Claude, Google’s Bard and Meta’s Llama 2, the researchers found.
It’s tough to say why the model responds the way it does to the random string of words, said Andy Zou, one of the paper’s authors. But it doesn’t bode well.
“Our work is one of the early signs that current systems that are already deployed today aren’t super safe,” he said.
An OpenAI spokesman said the company is working to make its models more resilient against prompt injection. The company blocked the adversarial strings in ChatGPT after the researchers shared their findings.
A Google spokesman said the company has a team dedicated to testing its generative AI products for safety, including training models to recognize bad prompts and creating “constitutions” that govern responses.
“The type of potentially problematic information referred to in this paper is already readily available on the internet,” a Meta spokesman said in a statement. “We determine the best way to release each new model responsibly.”
Anthropic didn’t immediately respond to a request for comment.
Is somebody going to fix this?
Software developers and cybersecurity professionals have created tests and benchmarks for traditional software to show it’s safe enough to use. Right now, the safety standards for LLM-based AI programs don’t measure up, said Zico Kolter, who wrote the prompt injection paper with Zou.
Software experts agree, however, that prompt injection is an especially tricky problem. One approach is to limit the instructions these models can accept, as well as the data they can access, said Matt Fredrikson, Zou and Kolter’s co-author. Another is to try to teach the models to recognize malicious prompts or avoid certain tasks. Either way, the onus is on AI companies to keep users safe, or at least clearly disclose the risks, Fredrikson said.
The question requires far more research, he said. But companies are rushing to build and sell AI assistants — and the more access these programs get to our data, the more potential for attacks.
Embra, an AI-assistant start-up that tried to build agents that would perform tasks on their own, recently stopped work in that area and narrowed its tools’ capabilities, founder Zach Tratar said on X.
“Autonomy + access to your private data = 🔥,” Tratar wrote.
Other AI companies may need to pump the breaks as well, said Willison, the programmer documenting prompt injection examples.
“It’s hard to get people to listen,” he said. “They’re like, ‘Yeah, but I want my personal assistant.’ I don’t think people will take it seriously until something harmful happens.”