I don’t know how to solve prompt injection
16th September 2022
Some extended thoughts about prompt injection attacks against software built on top of AI language models such a GPT-3. This post started as a Twitter thread but I’m promoting it to a full blog entry here.
The more I think about these prompt injection attacks against GPT-3, the more my amusement turns to genuine concern.
I know how to beat XSS, and SQL injection, and so many other exploits.
I have no idea how to reliably beat prompt injection!
As a security-minded engineer this really bothers me. I’m excited about the potential of building cool things against large language models.
But I want to be confident that I can secure them before I commit to shipping any software that uses this technology.
A big problem here is provability. Language models like GPT-3 are the ultimate black boxes. It doesn’t matter how many automated tests I write, I can never be 100% certain that a user won’t come up with some grammatical construct I hadn’t predicted that will subvert my defenses.
And in case you were thinking these attacks are still theoretical, yesterday provided a beautiful example of prompt injection attacks being used against a Twitter bot in the wild.
It also demonstrated their virality. Prompt injection attacks are fun! And you don’t need to be a programmer to execute them: you need to be able to type exploits in plain English, and adapt examples that you see working from others.
@glyph is no slouch when it comes to security engineering:
I don’t think that there is one. Those mitigations exist because they’re syntactic errors that people make; correct the syntax and you’ve corrected the error. Prompt injection isn’t an error! There’s no formal syntax for AI like this, that’s the whole point.
There are all kinds of things you can attempt to mitigate these exploits, using rules to evaluate input to check for potentially dangerous patterns.
But I don’t think any of those approaches can reach 100% confidence that an unanticipated input might not sneak past them somehow!
If I had a protection against XSS or SQL injection that worked for 99% of cases it would be only be a matter of time before someone figured out an exploit that snuck through.
And with prompt injection anyone who can construct a sentence in some human language (not even limited to English) is a potential attacker / vulnerability researcher!
Another reason to worry: let’s say you carefully construct a prompt that you believe to be 100% secure against prompt injection attacks (and again, I’m not at all sure that’s possible.)
What happens if you want to run it against a new version of the language model you are using?
Every time you upgrade your language model you effectively have to start from scratch on those mitigations—because who knows if that new model will have subtle new ways of interpreting prompts that open up brand new holes?
I remain hopeful that AI model providers can solve this by offering clean separation between “instructional” prompts and “user input” prompts. But I’d like to see formal research proving this can feasibly provide rock-solid protection against these attacks.
More recent articles
- Datasette Enrichments: a new plugin framework for augmenting your data - 1st December 2023
- llamafile is the new best way to run a LLM on your own computer - 29th November 2023
- Prompt injection explained, November 2023 edition - 27th November 2023
- I'm on the Newsroom Robots podcast, with thoughts on the OpenAI board - 25th November 2023
- Weeknotes: DevDay, GitHub Universe, OpenAI chaos - 22nd November 2023
- Deciphering clues in a news article to understand how it was reported - 22nd November 2023
- Exploring GPTs: ChatGPT in a trench coat? - 15th November 2023
- Financial sustainability for open source projects at GitHub Universe - 10th November 2023
- ospeak: a CLI tool for speaking text in the terminal via OpenAI - 7th November 2023
- DALL-E 3, GPT4All, PMTiles, sqlite-migrate, datasette-edit-schema - 30th October 2023