Extracting Prompts by Inverting LLM Outputs (via) New paper from Meta research:
We consider the problem of language model inversion: given outputs of a language model, we seek to extract the prompt that generated these outputs. We develop a new black-box method, output2prompt, that learns to extract prompts without access to the model's logits and without adversarial or jailbreaking queries. In contrast to previous work, output2prompt only needs outputs of normal user queries.
This is a way of extracting the hidden prompt from an application build on an LLM without using prompt injection techniques.
The trick is to train a dedicated model for guessing hidden prompts based on public question/answer pairs.
They conclude:
Our results demonstrate that many user and system prompts are intrinsically vulnerable to extraction.
This reinforces my opinion that it's not worth trying to protect your system prompts. Think of them the same as your client-side HTML and JavaScript: you might be able to obfuscate them but you should expect that people can view them if they try hard enough.
Recent articles
- Distributing Go binaries like sqlite-scanner through PyPI using go-to-wheel - 4th February 2026
- Moltbook is the most interesting place on the internet right now - 30th January 2026
- Adding dynamic features to an aggressively cached website - 28th January 2026