Running Llama 3.2 Vision and Phi-3.5 Vision on a Mac with mistral.rs
19th October 2024
mistral.rs is an LLM inference library written in Rust by Eric Buehler. Today I figured out how to use it to run the Llama 3.2 Vision and Phi-3.5 Vision models on my Mac.
Despite the name, it’s not just for the Mistral family of models—like how llama.cpp
has grown beyond Llama, mistral.rs
has grown beyond Mistral.
I already have a Rust installation, so I checked out and compiled the library like this:
git clone https://github.com/EricLBuehler/mistral.rs.git
cd mistral.rs
cargo run --features metal --release -- -i --isq Q4K vision-plain -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k -a vllama
This downloaded around 20GB of model files to ~/.cache/huggingface/hub/models--lamm-mit--Cephalo-Llama-3.2-11B-Vision-Instruct-128k
, loaded the model into memory and started a terminal-based chat session.
I ran this prompt:
\image https://niche-museums.imgix.net/pioneer-history.jpeg?w=1600&h=800&fit=crop&auto=compress <|image|> describe this image including any text
The prompt uses this image from my Niche Museums write-up of the Pioneer Memorial Museum:
I got back this response:
The image shows the exterior of the Daughters of Utah Pioneers (DUP) Museum, which is also known as the Pioneer Memorial Museum. The building is located in Salt Lake City, Utah, USA.
Visible in the image are:
The DUP Museum Building: The main structure of the museum is a prominent feature in the image. It is a two-story building with a rectangular shape and a flat roof. The exterior walls are made of brick or stone, giving it a classic and historic appearance.
Signage: Above the entrance, there is a sign that reads “Daughters of Utah Pioneers Museum” in bold letters. Below this sign, another sign indicates that the building is also known as the “Pioneer Memorial Museum.”
Entrance: The main entrance to the museum is marked by a doorway with a decorative frame and a pair of large wooden doors.
Landscaping: The surrounding area around the museum appears well-maintained with neatly trimmed grass and some trees nearby.
Statue or Monument: In front of the museum, there is a small statue or monument that appears to be dedicated to pioneers or early settlers.
The Daughters of Utah Pioneers (DUP) Museum showcases historical artifacts and exhibits related to Utah’s pioneer history, focusing on the experiences and contributions of early settlers in the region.
Note that the text it generates correctly describes the museum’s content and identifies the museum as being located in Salt Lake City, neither of which are details that were displayed in the image itself.
Activity Monitor confirmed that this Metal build was using my GPU, and about 10GB of memory, while the prompt was running.
mistral.rs
also provides a Python binding, which can be installed from PyPI but requires a Rust toolchain (I filed an issue requesting PyPI wheels):
pip install mistralrs-metal
I got an error running the Python Llama vision demo but the Phi 3.5 Vision example worked as advertised. I modified that to enable streaming output and ran it against a photograph from Mendenhall’s Museum of Gasoline Pumps & Petroliana:
from mistralrs import Runner, Which, ChatCompletionRequest, VisionArchitecture res = runner.send_chat_completion_request( ChatCompletionRequest( model="phi3v", messages=[ { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": "https://niche-museums.imgix.net/mendenhalls-16.jpeg?w=1200&auto=compress" }, }, { "type": "text", "text": "<|image_1|>\nWhat is shown in this image? Write a detailed response analyzing the scene.", }, ], } ], max_tokens=1024, presence_penalty=1.0, top_p=0.1, temperature=0.1, stream=True, ) ) for item in res: print(item.choices[0].delta.content, end='')
Here’s what It gave me:
The image captures a nostalgic scene of a vintage gas station, reminiscent of a bygone era. Dominating the left side of the frame is a red and white gas pump, standing tall and proud, its metallic surface gleaming under the soft light filtering through the wooden ceiling. Adjacent to it, a white and black gas canister stands ready for service, its presence adding to the authenticity of the setting.
On the right, a red and white gasoline sign hangs from the ceiling, its bold colors contrasting with the muted tones of the surroundings. Above it, a yellow and white gasoline sign is suspended, its vibrant hues drawing the eye upwards.
The floor beneath these relics of the past is a checkerboard pattern, a common design choice for gas stations of yesteryears. It provides a stark contrast to the smooth, unblemished surfaces of the gas pumps and canisters.
In the background, a variety of other signs and advertisements add to the eclectic mix of objects. They are a testament to the diverse range of products and services that were once available at this location.
Despite the passage of time, the gas station retains a certain charm, its vintage gas pumps and signs serving as a tangible link to a different era. The image is a snapshot of history, frozen in time, waiting to be discovered and appreciated by those who take the time to look closer.
This description looks fantastic at first glance, but if you review it carefully and compare it to the image you’ll see that it’s full of inaccuracies. The vibes of the description match the image but the actual details are definitely incorrect.
This model downloaded 7.7GB to ~/.cache/huggingface/hub/models--microsoft--Phi-3.5-vision-instruct
—significantly smaller than Llama 3.2’s 20GB. I wonder if that size difference helps explain the greater hallucination rate in Phi-3.5 Vision.
If you’re running Python 3.10 on Apple Silicon you may be able to skip the Rust compiler by installing the wheel I built here:
pip install https://static.simonwillison.net/static/2024/mistralrs_metal-0.3.1-cp310-cp310-macosx_11_0_arm64.whl
More recent articles
- Notes from Bing Chat—Our First Encounter With Manipulative AI - 19th November 2024
- Project: Civic Band - scraping and searching PDF meeting minutes from hundreds of municipalities - 16th November 2024
- Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac - 12th November 2024