<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: justine-tunney</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/justine-tunney.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2024-08-19T20:08:59+00:00</updated><author><name>Simon Willison</name></author><entry><title>llamafile v0.8.13 (and whisperfile)</title><link href="https://simonwillison.net/2024/Aug/19/whisperfile/#atom-tag" rel="alternate"/><published>2024-08-19T20:08:59+00:00</published><updated>2024-08-19T20:08:59+00:00</updated><id>https://simonwillison.net/2024/Aug/19/whisperfile/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.13"&gt;llamafile v0.8.13 (and whisperfile)&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The latest release of &lt;a href="https://github.com/Mozilla-Ocho/llamafile"&gt;llamafile&lt;/a&gt; (&lt;a href="https://simonwillison.net/2023/Nov/29/llamafile/"&gt;previously&lt;/a&gt;) adds support for &lt;a href="https://blog.google/technology/developers/gemma-open-models/"&gt;Gemma 2B&lt;/a&gt; (pre-bundled &lt;a href="https://huggingface.co/jartine/gemma-2-27b-it-llamafile/tree/main"&gt;llamafiles available here&lt;/a&gt;), significant performance improvements and new support for the Whisper speech-to-text model, based on &lt;a href="https://github.com/ggerganov/whisper.cpp"&gt;whisper.cpp&lt;/a&gt;, Georgi Gerganov's C++ implementation of Whisper that pre-dates his work on &lt;code&gt;llama.cpp&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;I got &lt;code&gt;whisperfile&lt;/code&gt; working locally by first downloading the cross-platform executable attached to &lt;a href="https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.13"&gt;the GitHub release&lt;/a&gt; and then grabbing a &lt;code&gt;whisper-tiny.en-q5_1.bin&lt;/code&gt; model from Hugging Face:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;wget -O whisper-tiny.en-q5_1.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.en-q5_1.bin
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I ran &lt;code&gt;chmod 755 whisperfile-0.8.13&lt;/code&gt; and then executed it against an example &lt;code&gt;.wav&lt;/code&gt; file like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;./whisperfile-0.8.13 -m whisper-tiny.en-q5_1.bin -f raven_poe_64kb.wav --no-prints
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;--no-prints&lt;/code&gt; option suppresses the debug output, so you just get text that looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[00:00:00.000 --&amp;gt; 00:00:12.000]   This is a LibraVox recording. All LibraVox recordings are in the public domain. For more information please visit LibraVox.org.
[00:00:12.000 --&amp;gt; 00:00:20.000]   Today's reading The Raven by Edgar Allan Poe, read by Chris Scurringe.
[00:00:20.000 --&amp;gt; 00:00:40.000]   Once upon a midnight dreary, while I pondered weak and weary, over many a quaint and curious volume of forgotten lore. While I nodded nearly napping, suddenly there came a tapping as of someone gently rapping, rapping at my chamber door.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;There are quite a few &lt;a href="https://github.com/Mozilla-Ocho/llamafile/issues/544#issuecomment-2297368432"&gt;undocumented options&lt;/a&gt; - to write out JSON to a file called &lt;code&gt;transcript.json&lt;/code&gt; (&lt;a href="https://gist.github.com/simonw/39173ac94e71cb01b749f9256a9408c4"&gt;example output&lt;/a&gt;):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;./whisperfile-0.8.13 -m whisper-tiny.en-q5_1.bin -f /tmp/raven_poe_64kb.wav --no-prints --output-json --output-file transcript
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I had to convert my own audio recordings to 16kHz &lt;code&gt;.wav&lt;/code&gt; files in order to use them with &lt;code&gt;whisperfile&lt;/code&gt;. I used &lt;code&gt;ffmpeg&lt;/code&gt; to do this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ffmpeg -i runthrough-26-oct-2023.wav -ar 16000 /tmp/out.wav
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I could transcribe that like so:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;./whisperfile-0.8.13 -m whisper-tiny.en-q5_1.bin -f /tmp/out.wav --no-prints
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: &lt;a href="https://twitter.com/JustineTunney/status/1825676741593149949"&gt;Justine says&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I've just uploaded new whisperfiles &lt;a href="https://huggingface.co/Mozilla/whisperfile"&gt;to Hugging Face&lt;/a&gt; which use miniaudio.h to automatically resample and convert your mp3/ogg/flac/wav files to the appropriate format.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;With that &lt;code&gt;whisper-tiny&lt;/code&gt; model this took just 11s to transcribe a 10m41s audio file!&lt;/p&gt;
&lt;p&gt;I also tried the much larger Whisper Medium model - I chose to use the 539MB  &lt;code&gt;ggml-medium-q5_0.bin&lt;/code&gt; quantized version of that from &lt;a href="https://huggingface.co/ggerganov/whisper.cpp/tree/main"&gt;huggingface.co/ggerganov/whisper.cpp&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;./whisperfile-0.8.13 -m ggml-medium-q5_0.bin -f out.wav --no-prints
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This time it took 1m49s, using 761% of CPU according to Activity Monitor.&lt;/p&gt;
&lt;p&gt;I tried adding &lt;code&gt;--gpu auto&lt;/code&gt; to exercise the GPU on my M2 Max MacBook Pro:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;./whisperfile-0.8.13 -m ggml-medium-q5_0.bin -f out.wav --no-prints --gpu auto
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That used just 16.9% of CPU and 93% of GPU according to Activity Monitor, and finished in 1m08s. &lt;/p&gt;
&lt;p&gt;I tried this with the &lt;code&gt;tiny&lt;/code&gt; model too but the performance difference there was imperceptible.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/JustineTunney/status/1825551821857010143"&gt;@JustineTunney&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ffmpeg"&gt;ffmpeg&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/whisper"&gt;whisper&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llamafile"&gt;llamafile&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/justine-tunney"&gt;justine-tunney&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/speech-to-text"&gt;speech-to-text&lt;/a&gt;&lt;/p&gt;



</summary><category term="ffmpeg"/><category term="ai"/><category term="whisper"/><category term="local-llms"/><category term="llamafile"/><category term="justine-tunney"/><category term="speech-to-text"/></entry><entry><title>gemma-2-27b-it-llamafile</title><link href="https://simonwillison.net/2024/Jul/2/gemma-2-27b-it-llamafile/#atom-tag" rel="alternate"/><published>2024-07-02T22:38:06+00:00</published><updated>2024-07-02T22:38:06+00:00</updated><id>https://simonwillison.net/2024/Jul/2/gemma-2-27b-it-llamafile/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/jartine/gemma-2-27b-it-llamafile"&gt;gemma-2-27b-it-llamafile&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Justine Tunney shipped llamafile packages of Google's new openly licensed (though definitely not open source) &lt;a href="https://ai.google.dev/gemma"&gt;Gemma&lt;/a&gt; 2 27b model this morning.&lt;/p&gt;
&lt;p&gt;I downloaded the &lt;code&gt;gemma-2-27b-it.Q5_1.llamafile&lt;/code&gt; version (20.5GB) to my Mac, ran &lt;code&gt;chmod 755 gemma-2-27b-it.Q5_1.llamafile&lt;/code&gt; and then &lt;code&gt;./gemma-2-27b-it.Q5_1.llamafile&lt;/code&gt; and now I'm trying it out through the &lt;code&gt;llama.cpp&lt;/code&gt; default web UI in my browser. It works great.&lt;/p&gt;
&lt;p&gt;It's a &lt;em&gt;very&lt;/em&gt; capable model - currently sitting at position 12 on the &lt;a href="https://chat.lmsys.org/"&gt;LMSYS Arena&lt;/a&gt; making it the highest ranked open weights model - one position ahead of Llama-3-70b-Instruct and within striking distance of the GPT-4 class models.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/JustineTunney/status/1808165898743878108"&gt;@JustineTunney&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/google"&gt;google&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llamafile"&gt;llamafile&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/justine-tunney"&gt;justine-tunney&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama-cpp"&gt;llama-cpp&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gemma"&gt;gemma&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/chatbot-arena"&gt;chatbot-arena&lt;/a&gt;&lt;/p&gt;



</summary><category term="google"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="llamafile"/><category term="justine-tunney"/><category term="llama-cpp"/><category term="gemma"/><category term="chatbot-arena"/></entry><entry><title>llamafile is the new best way to run an LLM on your own computer</title><link href="https://simonwillison.net/2023/Nov/29/llamafile/#atom-tag" rel="alternate"/><published>2023-11-29T20:54:16+00:00</published><updated>2023-11-29T20:54:16+00:00</updated><id>https://simonwillison.net/2023/Nov/29/llamafile/#atom-tag</id><summary type="html">
    &lt;p&gt;Mozilla’s innovation group and Justine Tunney &lt;a href="https://hacks.mozilla.org/2023/11/introducing-llamafile/"&gt;just released llamafile&lt;/a&gt;, and I think it's now the single best way to get started running Large Language Models (think your own local copy of ChatGPT) on your own computer.&lt;/p&gt;
&lt;p&gt;A llamafile is a single multi-GB file that contains both the model weights for an LLM and the code needed to run that model - in some cases a full local server with a web UI for interacting with it.&lt;/p&gt;
&lt;p&gt;The executable is compiled using &lt;a href="https://justine.lol/cosmopolitan/index.html"&gt;Cosmopolitan Libc&lt;/a&gt;, Justine's incredible project that supports compiling a single binary that works, unmodified, on multiple different operating systems and hardware architectures.&lt;/p&gt;
&lt;p&gt;Here's how to get started with &lt;a href="https://llava-vl.github.io/"&gt;LLaVA 1.5&lt;/a&gt;, a large multimodal model (which means text and image inputs, like GPT-4 Vision) fine-tuned on top of Llama 2. I've tested this process on an M2 Mac, but it should work on other platforms as well (though be sure to &lt;a href="https://github.com/mozilla-Ocho/llamafile#gotchas"&gt;read the Gotchas&lt;/a&gt; section of the README, and take a look at Justine's &lt;a href="https://news.ycombinator.com/item?id=38465916"&gt;list of supported platforms&lt;/a&gt; in a comment on Hacker News).&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Download the 4.29GB &lt;code&gt;llava-v1.5-7b-q4.llamafile&lt;/code&gt; file &lt;a href="https://huggingface.co/jartine/llava-v1.5-7B-GGUF/tree/main"&gt;from Justine's repository on Hugging Face&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;curl -LO https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4.llamafile&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Make that binary executable, by running this in a terminal:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;chmod 755 llava-v1.5-7b-q4.llamafile&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Run your new executable, which will start a web server on port 8080:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;./llava-v1.5-7b-q4.llamafile&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Navigate to &lt;code&gt;http://127.0.0.1:8080/&lt;/code&gt; to start interacting with the model in your browser.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;That's all there is to it. On my M2 Mac it runs at around 55 tokens a second, which is &lt;em&gt;really&lt;/em&gt; fast. And it can analyze images - here's what I got when I uploaded a photograph and asked "Describe this plant":&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/llamafile.jpg" alt="Screenshot. llama.cpp - then a photo I took of a plant

User: Describe this plant

Llama: The image features a large, green plant with numerous thin branches and leaves. Among the many stems of this plant, there is an orange flower visible near its center. This beautifully decorated plant stands out in the scene due to its vibrant colors and intricate structure.

18ms per token, 54.24 tokens per second
Powered by llama.cpp, ggml.ai, and llamafile" style="max-width: 100%" /&gt;&lt;/p&gt;
&lt;h4 id="llamafile-how-this-works"&gt;How this works&lt;/h4&gt;
&lt;p&gt;There are a number of different components working together here to make this work.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The LLaVA 1.5 model by Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee is &lt;a href="https://arxiv.org/abs/2310.03744"&gt;described in this paper&lt;/a&gt;, with further details on &lt;a href="https://llava-vl.github.io/"&gt;llava-vl.github.io&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;The models are executed using &lt;a href="https://github.com/ggerganov/llama.cpp"&gt;llama.cpp&lt;/a&gt;, and in the above demo also use the &lt;code&gt;llama.cpp&lt;/code&gt; server example to provide the UI.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://justine.lol/cosmopolitan/index.html"&gt;Cosmopolitan Libc&lt;/a&gt; is the magic that makes one binary work on multiple platforms. I wrote more about that in a TIL a few months ago, &lt;a href="https://til.simonwillison.net/cosmopolitan/ecosystem"&gt;Catching up with the Cosmopolitan ecosystem&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="llamafile-trying-other-models"&gt;Trying more models&lt;/h4&gt;
&lt;p&gt;The &lt;a href="https://github.com/mozilla-Ocho/llamafile"&gt;llamafile README&lt;/a&gt; currently links to binaries for &lt;code&gt;Mistral-7B-Instruct&lt;/code&gt;, &lt;code&gt;LLaVA 1.5&lt;/code&gt; and &lt;code&gt;WizardCoder-Python-13B&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;You can also download a much smaller &lt;code&gt;llamafile&lt;/code&gt; binary from &lt;a href="https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.1"&gt;their releases&lt;/a&gt;, which can then execute any model that has been compiled to GGUF format:&lt;/p&gt;
&lt;p&gt;I grabbed &lt;code&gt;llamafile-server-0.1&lt;/code&gt; (4.45MB) like this:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;curl -LO https://github.com/Mozilla-Ocho/llamafile/releases/download/0.1/llamafile-server-0.1
chmod 755 llamafile-server-0.1&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then ran it against a 13GB &lt;code&gt;llama-2-13b.Q8_0.gguf&lt;/code&gt; file I had &lt;a href="https://huggingface.co/TheBloke/Llama-2-13B-GGUF/tree/main"&gt;previously downloaded&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;./llamafile-server-0.1 -m llama-2-13b.Q8_0.gguf&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This gave me the same interface at &lt;code&gt;http://127.0.0.1:8080/&lt;/code&gt; (without the image upload) and let me talk with the model at 24 tokens per second.&lt;/p&gt;
&lt;h4 id="llamafile-one-file"&gt;One file is all you need&lt;/h4&gt;
&lt;p&gt;I think my favourite thing about llamafile is what it represents. This is a single binary file which you can download and then use, forever, on (almost) any computer.&lt;/p&gt;

&lt;p&gt;You don't need a network connection, and you don't need to keep track of more than one file.&lt;/p&gt;

&lt;p&gt;Stick that file on a USB stick and stash it in a drawer as insurance against a future apocalypse. You'll never be without a language model ever again.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/mozilla"&gt;mozilla&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cosmopolitan"&gt;cosmopolitan&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mistral"&gt;mistral&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llamafile"&gt;llamafile&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/justine-tunney"&gt;justine-tunney&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama-cpp"&gt;llama-cpp&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="mozilla"/><category term="ai"/><category term="generative-ai"/><category term="cosmopolitan"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="mistral"/><category term="llamafile"/><category term="justine-tunney"/><category term="llama-cpp"/></entry><entry><title>Cosmopolitan: Compiling Python</title><link href="https://simonwillison.net/2022/Jul/26/cosmopolitan-python/#atom-tag" rel="alternate"/><published>2022-07-26T20:43:31+00:00</published><updated>2022-07-26T20:43:31+00:00</updated><id>https://simonwillison.net/2022/Jul/26/cosmopolitan-python/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/jart/cosmopolitan/issues/141"&gt;Cosmopolitan: Compiling Python&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Cosmopolitan is Justine Tunney’s “build-once run-anywhere C library”—part of the αcτµαlly pδrταblε εxεcµταblε effort, which produces wildly clever binary executable files that work on multiple different platforms, and is the secret sauce behind redbean. I hadn’t realized this was happening but there’s an active project to get Python to work as this format, producing a new way of running Python applications as standalone executables, only these ones have the potential to run unmodified on Windows, Linux and macOS.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/redbean"&gt;redbean&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cosmopolitan"&gt;cosmopolitan&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/justine-tunney"&gt;justine-tunney&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="redbean"/><category term="cosmopolitan"/><category term="justine-tunney"/></entry></feed>