To evaluate the model’s capability in processing long-context inputs, we construct a video “Needle-in- a-Haystack” evaluation on Qwen3-VL-235B-A22B-Instruct. In this task, a semantically salient “needle” frame—containing critical visual evidence—is inserted at varying temporal positions within a long video. The model is then tasked with accurately locating the target frame from the long video and answering the corresponding question. [...]
As shown in Figure 3, the model achieves a perfect 100% accuracy on videos up to 30 minutes in duration—corresponding to a context length of 256K tokens. Remarkably, even when extrapolating to sequences of up to 1M tokens (approximately 2 hours of video) via YaRN-based positional extension, the model retains a high accuracy of 99.5%.
— Qwen3-VL Technical Report, 5.12.3: Needle-in-a-Haystack
Recent articles
- ChatGPT Containers can now run bash, pip/npm install packages, and download files - 26th January 2026
- Wilson Lin on FastRender: a browser built by thousands of parallel agents - 23rd January 2026
- First impressions of Claude Cowork, Anthropic's general agent - 12th January 2026