Extracting Prompts by Inverting LLM Outputs (via) New paper from Meta research:
We consider the problem of language model inversion: given outputs of a language model, we seek to extract the prompt that generated these outputs. We develop a new black-box method, output2prompt, that learns to extract prompts without access to the model's logits and without adversarial or jailbreaking queries. In contrast to previous work, output2prompt only needs outputs of normal user queries.
This is a way of extracting the hidden prompt from an application build on an LLM without using prompt injection techniques.
The trick is to train a dedicated model for guessing hidden prompts based on public question/answer pairs.
They conclude:
Our results demonstrate that many user and system prompts are intrinsically vulnerable to extraction.
This reinforces my opinion that it's not worth trying to protect your system prompts. Think of them the same as your client-side HTML and JavaScript: you might be able to obfuscate them but you should expect that people can view them if they try hard enough.
Recent articles
- Highlights from my appearance on the Data Renegades podcast with CL Kao and Dori Wilson - 26th November 2025
- Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult - 24th November 2025
- sqlite-utils 4.0a1 has several (minor) backwards incompatible changes - 24th November 2025