AbsenceBench: Language Models Can't Tell What's Missing (via) Here's another interesting result to file under the "jagged frontier" of LLMs, where their strengths and weaknesses are often unintuitive.
Long context models have been getting increasingly good at passing "Needle in a Haystack" tests recently, but what about a problem in the opposite direction?
This paper explores what happens when you give a model some content and then a copy with a portion removed, then ask what changed.
Here's a truncated table of results from the paper:
Models | Poetry | Sequences | GitHub PRs | Average |
---|---|---|---|---|
Gemini-2.5-flash* |
87.3 | 95.4 | 30.9 | 71.2 |
Claude-3.7-Sonnet* |
72.7 | 96.0 | 40.0 | 69.6 |
Claude-3.7-Sonnet | 73.5 | 91.4 | 35.7 | 66.9 |
Gemini-2.5-flash | 79.3 | 85.2 | 26.2 | 63.6 |
o3-mini* |
65.0 | 78.1 | 38.9 | 60.7 |
GPT-4.1 | 54.3 | 57.5 | 36.2 | 49.3 |
... | ... | ... | ... | ... |
DeepSeek-R1* |
38.7 | 29.5 | 23.1 | 30.4 |
Qwen3-235B* |
26.1 | 18.5 | 24.6 | 23.1 |
Mixtral-8x7B-Instruct | 4.9 | 21.9 | 17.3 | 14.7 |
*
indicates a reasoning model. Sequences are lists of numbers like 117,121,125,129,133,137
, Poetry consists of 100-1000 line portions from the Gutenberg Poetry Corpus and PRs are diffs with 10 to 200 updated lines.
The strongest models do well at numeric sequences, adequately at the poetry challenge and really poorly with those PR diffs. Reasoning models do slightly better at the cost of burning through a lot of reasoning tokens - often more than the length of the original document.
The paper authors - Harvey Yiyun Fu and Aryan Shrivastava and Jared Moore and Peter West and Chenhao Tan and Ari Holtzman - have a hypothesis as to what's going on here:
We propose an initial hypothesis explaining this behavior: identifying presence is simpler than absence with the attention mechanisms underlying Transformers (Vaswani et al., 2017). Information included in a document can be directly attended to, while the absence of information cannot.