Pixtral 12B. Mistral finally have a multi-modal (image + text) vision LLM!
I linked to their tweet, but there’s not much to see there - in now classic Mistral style they released the new model with an otherwise unlabeled link to a torrent download. A more useful link is mistral-community/pixtral-12b-240910 on Hugging Face, a 25GB “Unofficial Mistral Community” copy of the weights.
Pixtral was announced at Mistral’s AI Summit event in San Francisco today. It has 128,000 token context, is Apache 2.0 licensed and handles 1024x1024 pixel images. They claim it’s particularly good for OCR and information extraction. It’s not available on their La Platforme hosted API yet, but that’s coming soon.
A few more details can be found in the release notes for mistral-common 1.4.0. That’s their open source library of code for working with the models - it doesn’t actually run inference, but it includes the all-important tokenizer, which now includes three new special tokens: [IMG]
, [IMG_BREAK]
and [IMG_END]
.
Recent articles
- OpenAI DevDay: Let’s build developer tools, not digital God - 2nd October 2024
- OpenAI DevDay 2024 live blog - 1st October 2024
- Weeknotes: Three podcasts, two trips and a new plugin system - 30th September 2024