Trying out QvQ—Qwen’s new visual reasoning model
24th December 2024
I thought we were done for major model releases in 2024, but apparently not: Alibaba’s Qwen team just dropped the Apache 2.0 licensed QvQ-72B-Preview, “an experimental research model focusing on enhancing visual reasoning capabilities”.
Their blog post is titled QvQ: To See the World with Wisdom—similar flowery language to their QwQ announcement QwQ: Reflect Deeply on the Boundaries of the Unknown a few weeks ago in November.
It’s a vision-focused follow-up to QwQ, which I wrote about previousy. QwQ is an impressive openly licensed inference-scaling model: give it a prompt and it will think out loud over many tokens while trying to derive a good answer, similar to OpenAI’s o1 and o3 models.
The new QvQ adds vision to the mix. You can try it out on Hugging Face Spaces—it accepts an image and a single prompt and then streams out a very long response where it thinks through the problem you have posed it. There’s no option to send a follow-up prompt.
I’ve tried it out with a bunch of things, with mixed results—but it’s really fun seeing how it works through a problem.
My most successful prompt was “Count the pelicans” with this image:
Here’s the full response. Some highlights:
So I’ve got this picture of some pelicans, and I need to count them. Let me see... First off, I know that pelicans are those big birds with the massive beaks, right? [...]
Let me try counting again, more carefully this time. Starting from the left:
- The first pelican is fully visible, standing at the front.
- Behind it, slightly to the right, is the second pelican.
- Then, further back, there’s a third pelican.
- And next to that one, there’s a fourth pelican.
So, that’s four pelicans in total. But, I notice that on the far right of the image, there’s part of another bird, maybe a pelican’s back or tail feathers, but it’s not fully in the frame, so I don’t think I should count it as a whole pelican.
Something about the tone of voice it uses, “First off, I know that pelicans are those big birds with the massive beaks, right?”, tickles me.
I also tried this on a much harder image—it counted 17 which looks about right to me but honestly I have trouble with that image myself.
I fed in one of the ARC-AGI puzzles that o3 had failed at.
It produced a very lengthy chain of thought that was almost entirely incorrect, but had some interesting “ideas” in it:
[...] Let me try to think in terms of cellular automata rules, like Conway’s Game of Life.
In Game of Life, each cell’s state in the next generation is determined by its current state and the states of its eight neighbors.
Perhaps a similar rule applies here.
Finally, I asked it to “Estimate the height of the dinosaur” against this image (which, as it correctly noted, is actually an inflatable dragon):
I’m not sure this task is even possible given the lack of items in the frame of comparable known height. It had a good go at it though! I particularly enjoyed how it began:
So I’ve got this inflatable Christmas decoration, and it’s a dragon! But not just any dragon—it’s a green dragon wearing a Santa hat and a red and white scarf. And get this, there’s a reindeer on its back, also wearing a Santa hat and holding onto the dragon’s wings. It’s pretty festive and colorful, with lots of vibrant greens, reds, and yellows.
The dragon has these big, friendly eyes and a smiling face, which makes it look really cheerful. Its wings are spread out, and it has white claws on its feet. The reindeer is brown with antlers and seems to be enjoying the ride on the dragon’s back.
Now, the question is, how tall is this inflatable dragon? I need to estimate its height. Let me see. I’m standing next to it, and I’m about 5 feet 10 inches tall. The dragon is definitely taller than me. Maybe around 8 or 9 feet high? But I should try to be more precise.
I wonder how it decided that its own height was 5 feet 10 inches!
Running QvQ locally
All of my experiments so far have used the hosted Hugging Face demo. I’m hoping to try it out on my own laptop soon—I think it should just fit in 64GB of M2, maybe with a smaller quantization.
Right now the options are:
- Qwen/QVQ-72B-Preview on Hugging Face has the GPU model weights, for use with Hugging Face Transformers and the qwen-vl-utils Python package.
- Prince Canuma converted the model for Apple’s MLX framework as mlx-community/qvq-72b-preview and got it working with his excellent mlx-vlm package.
As a happy user of Ollama’s qwq port I’m hoping they add a QvQ release at some point soon as well.
Running QvQ on my Mac using mlx-vlm
Update: I successfully ran QvQ on my 64GB M2 MacBook Pro using Prince Canuma’s mlx-vlm
via uv run
with the following command:
uv run --with 'numpy<2.0' --with mlx-vlm python \
-m mlx_vlm.generate \
--model mlx-community/QVQ-72B-Preview-4bit \
--max-tokens 10000 \
--temp 0.0 \
--prompt "describe this" \
--image pelicans-on-bicycles-veo2.jpg
My image input here was taken from my experiment with Google’s Veo 2 video generation model:
Here’s the full output from running that command, after it first downloaded 38GB to ~/.cache/huggingface/hub/models--mlx-community--QVQ-72B-Preview-4bi
:
==========
Image: ['pelicans-on-bicycles-veo2.jpg']
Prompt: <|im_start|>system
You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.<|im_end|>
<|im_start|>user
describe this<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant
Alright, I've got this "Text to Video" tool to explore. It seems pretty advanced, allowing me to turn text descriptions into actual videos. The interface looks user-friendly, with a dark theme that's easy on the eyes. On the left side, there's a panel where I can input my text prompt. It already has an example filled in: "A pelican riding a bicycle along a coastal path overlooking a harbor."
That sounds interesting! I can see four different video thumbnails on the right side, each showing variations of this scene. One of them even has the pelican wearing a helmet, which adds a humorous touch. Another shows the pelican standing next to a bicycle, almost as if it's ready to hop on and start pedaling.
Below the text input, there are some style options like "photorealistic," "35mm film," "minimal," "sketchy," "handmade," and "abstract." These probably let me choose the visual style of the video. For now, I'll stick with the default settings.
At the bottom of the left panel, there are two buttons: "Start over" and "Create videos." I guess "Create videos" is the one I need to press to generate the videos based on my text prompt.
Let me think about this. I want to see what kind of videos this tool can produce based on my description. The example prompt is already filled in, so maybe I should just go ahead and click "Create videos" to see what happens.
Wait a minute, the prompt mentions a coastal path overlooking a harbor. In the thumbnails, I can see a background with water and boats, which matches the description. It's cool how the tool has interpreted the text into visual elements.
I should also pay attention to the "Visual Prompt" section, which says "View only." That probably means I can't edit it directly, but I can view the generated videos based on this prompt.
Hmm, maybe I should try changing the prompt to see different results. For example, instead of a pelican riding a bicycle, maybe I could describe a different scene. But for now, I'll stick with the given example to understand how it works.
After clicking "Create videos," I expect to see the videos being generated. There might be a loading process, depending on the complexity of the prompt and the tool's processing speed.
Once the videos are ready, I can preview them to see how well they match my description. If I'm not satisfied, I can always go back and adjust the prompt or try different style options.
Overall, this seems like a powerful tool for creating custom videos based on textual descriptions. It could be really useful for storytelling, presentations, or even just for fun, experimenting with different scenes and styles.
I think I'll spend some time exploring the different style options to see how they affect the final videos. For instance, selecting "35mm film" might give the videos a vintage, cinematic look, while "abstract" could produce more artistic and less realistic visuals.
Also, I wonder if there's a way to customize more specific aspects, like the camera angle, lighting, or movement. Maybe after generating the initial videos, there are additional editing options available.
In summary, I'm excited to use this Text to Video tool and see what kind of creative possibilities it opens up. With a bit of practice, I could probably create some impressive visual content just by describing what I want to see.
**Final Answer**
To proceed with generating videos based on the provided text prompt, I need to click on the "Create videos" button. This action will initiate the process of converting the textual description into visual content, resulting in multiple video thumbnails that can be previewed for further use or adjustments.
==========
Prompt: 0.870 tokens-per-sec
Generation: 7.694 tokens-per-sec
More recent articles
- My approach to running a link blog - 22nd December 2024
- Live blog: the 12th day of OpenAI - "Early evals for OpenAI o3" - 20th December 2024