Simon Willison’s Weblog

Subscribe
Atom feed for local-llms

118 posts tagged “local-llms”

LLMs that can run on consumer hardware like laptops or mobile phones.

2024

Introducing Apple’s On-Device and Server Foundation Models. Apple Intelligence uses both on-device and in-the-cloud models that were trained from scratch by Apple.

Their on-device model is a 3B model that "outperforms larger models including Phi-3-mini, Mistral-7B, and Gemma-7B", while the larger cloud model is comparable to GPT-3.5.

The language models were trained on unlicensed scraped data - I was hoping they might have managed to avoid that, but sadly not:

We train our foundation models on licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web-crawler, AppleBot.

The most interesting thing here is the way they apply fine-tuning to the local model to specialize it for different tasks. Apple call these "adapters", and they use LoRA for this - a technique first published in 2021. This lets them run multiple on-device models based on a shared foundation, specializing in tasks such as summarization and proof-reading.

Here's the section of the Platforms State of the Union talk that talks about the foundation models and their fine-tuned variants.

As Hamel Husain says:

This talk from Apple is the best ad for fine tuning that probably exists.

The video also describes their approach to quantization:

The next step we took is compressing the model. We leveraged state-of-the-art quantization techniques to take a 16-bit per parameter model down to an average of less than 4 bits per parameter to fit on Apple Intelligence-supported devices, all while maintaining model quality.

Still no news on how their on-device image model was trained. I'd love to find out it was trained exclusively using licensed imagery - Apple struck a deal with Shutterstock a few months ago.

# 11th June 2024, 3:44 pm / apple, llms, ai, generative-ai, fine-tuning, apple-intelligence, local-llms

Ultravox (via) Ultravox is "a multimodal Speech LLM built around a pretrained Whisper and Llama 3 backbone". It's effectively an openly licensed version of half of the GPT-4o model OpenAI demoed (but did not fully release) a few weeks ago: Ultravox is multimodal for audio input, but still relies on a separate text-to-speech engine for audio output.

You can try it out directly in your browser through this page on AI.TOWN - hit the "Call" button to start an in-browser voice conversation with the model.

I found the demo extremely impressive - really low latency and it was fun and engaging to talk to. Try saying "pretend to be a wise and sarcastic old fox" to kick it into a different personality.

The GitHub repo includes code for both training and inference, and the full model is available from Hugging Face - about 30GB of .safetensors files.

Ultravox says it's licensed under MIT, but I would expect it to also have to inherit aspects of the Llama 3 license since it uses that as a base model.

# 10th June 2024, 5:34 am / generative-ai, llama, text-to-speech, ai, local-llms, llms

PaliGemma model README (via) One of the more over-looked announcements from Google I/O yesterday was PaliGemma, an openly licensed VLM (Vision Language Model) in the Gemma family of models.

The model accepts an image and a text prompt. It outputs text, but that text can include special tokens representing regions on the image. This means it can return both bounding boxes and fuzzier segment outlines of detected objects, behavior that can be triggered using a prompt such as "segment puffins".

From the README:

PaliGemma uses the Gemma tokenizer with 256,000 tokens, but we further extend its vocabulary with 1024 entries that represent coordinates in normalized image-space (<loc0000>...<loc1023>), and another with 128 entries (<seg000>...<seg127>) that are codewords used by a lightweight referring-expression segmentation vector-quantized variational auto-encoder (VQ-VAE) [...]

You can try it out on Hugging Face.

It's a 3B model, making it feasible to run on consumer hardware.

# 15th May 2024, 9:16 pm / google, generative-ai, google-io, ai, local-llms, llms, gemma, vision-llms, image-segmentation

experimental-phi3-webgpu (via) Run Microsoft’s excellent Phi-3 model directly in your browser, using WebGPU so didn’t work in Firefox for me, just in Chrome.

It fetches around 2.1GB of data into the browser cache on first run, but then gave me decent quality responses to my prompts running at an impressive 21 tokens a second (M2, 64GB).

I think Phi-3 is the highest quality model of this size, so it’s a really good fit for running in a browser like this.

# 9th May 2024, 10:21 pm / browsers, webassembly, generative-ai, ai, local-llms, llms, phi, webgpu

microsoft/Phi-3-mini-4k-instruct-gguf (via) Microsoft’s Phi-3 LLM is out and it’s really impressive. This 4,000 token context GGUF model is just a 2.2GB (for the Q4 version) and ran on my Mac using the llamafile option described in the README. I could then run prompts through it using the llm-llamafile plugin.

The vibes are good! Initial test prompts I’ve tried feel similar to much larger 7B models, despite using just a few GBs of RAM. Tokens are returned fast too—it feels like the fastest model I’ve tried yet.

And it’s MIT licensed.

# 23rd April 2024, 5:40 pm / llms, llm, generative-ai, ai, local-llms, microsoft, phi

We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone.

Phi-3 Technical Report

# 23rd April 2024, 3 am / generative-ai, microsoft, ai, local-llms, llms

Options for accessing Llama 3 from the terminal using LLM

Visit Options for accessing Llama 3 from the terminal using LLM

Llama 3 was released on Thursday. Early indications are that it’s now the best available openly licensed model—Llama 3 70b Instruct has taken joint 5th place on the LMSYS arena leaderboard, behind only Claude 3 Opus and some GPT-4s and sharing 5th place with Gemini Pro and Claude 3 Sonnet. But unlike those other models Llama 3 70b is weights available and can even be run on a (high end) laptop!

[... 1,962 words]

llm-gpt4all. New release of my LLM plugin which builds on Nomic's excellent gpt4all Python library. I've upgraded to their latest version which adds support for Llama 3 8B Instruct, so after a 4.4GB model download this works:

llm -m Meta-Llama-3-8B-Instruct "say hi in Spanish"

# 20th April 2024, 5:58 pm / nomic, llm, plugins, projects, generative-ai, ai, llms, llama, local-llms

Mistral tweet a magnet link for mixtral-8x22b. Another open model release from Mistral using their now standard operating procedure of tweeting out a raw torrent link.

This one is an 8x22B Mixture of Experts model. Their previous most powerful openly licensed release was Mixtral 8x7B, so this one is a whole lot bigger (a 281GB download)—and apparently has a 65,536 context length, at least according to initial rumors on Twitter.

# 10th April 2024, 2:31 am / mistral, generative-ai, ai, local-llms, llms, llm-release

Gemma: Introducing new state-of-the-art open models. Google get in on the openly licensed LLM game: Gemma comes in two sizes, 2B and 7B, trained on 2 trillion and 6 trillion tokens respectively. The terms of use “permit responsible commercial usage”. In the benchmarks it appears to compare favorably to Mistral and Llama 2.

Something that caught my eye in the terms: “Google may update Gemma from time to time, and you must make reasonable efforts to use the latest version of Gemma.”

One of the biggest benefits of running your own model is that it can protect you from model updates that break your carefully tested prompts, so I’m not thrilled by that particular clause.

UPDATE: It turns out that clause isn’t uncommon—the phrase “You shall undertake reasonable efforts to use the latest version of the Model” is present in both the Stable Diffusion and BigScience Open RAIL-M licenses.

# 21st February 2024, 4:22 pm / google, generative-ai, ai, local-llms, llms, gemma

Mixtral of Experts. The Mixtral paper is out, exactly a month after the release of the Mixtral 8x7B model itself. Thanks to the paper I now have a reasonable understanding of how a mixture of experts model works: each layer has 8 available blocks, but a router model selects two out of those eight for each token passing through that layer and combines their output. “As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference.”

The Mixtral token context size is an impressive 32k, and it compares extremely well against the much larger Llama 70B across a whole array of benchmarks.

Unsurprising but disappointing: there’s nothing in the paper at all about what it was trained on.

# 9th January 2024, 4:03 am / mistral, llms, ai, generative-ai, local-llms

2023

Many options for running Mistral models in your terminal using LLM

Visit Many options for running Mistral models in your terminal using LLM

Mistral AI is the most exciting AI research lab at the moment. They’ve now released two extremely powerful smaller Large Language Models under an Apache 2 license, and have a third much larger one that’s available via their API.

[... 2,063 words]

The AI trust crisis

Visit The AI trust crisis

Dropbox added some new AI features. In the past couple of days these have attracted a firestorm of criticism. Benj Edwards rounds it up in Dropbox spooks users with new AI features that send data to OpenAI when used.

[... 1,733 words]

Mixtral of experts (via) Mistral have firmly established themselves as the most exciting AI lab outside of OpenAI, arguably more exciting because much of their work is released under open licenses.

On December 8th they tweeted a link to a torrent, with no additional context (a neat marketing trick they’ve used in the past). The 87GB torrent contained a new model, Mixtral-8x7b-32kseqlen—a Mixture of Experts.

Three days later they published a full write-up, describing “Mixtral 8x7B, a high-quality sparse mixture of experts model (SMoE) with open weights”—licensed Apache 2.0.

They claim “Mixtral outperforms Llama 2 70B on most benchmarks with 6x faster inference”—and that it outperforms GPT-3.5 on most benchmarks too.

This isn’t even their current best model. The new Mistral API platform (currently on a waitlist) refers to Mixtral as “Mistral-small” (and their previous 7B model as “Mistral-tiny”—and also provides access to a currently closed model, “Mistral-medium”, which they claim to be competitive with GPT-4.

# 11th December 2023, 5:20 pm / mistral, generative-ai, gpt-4, ai, llms, llm-release, local-llms

llamafile is the new best way to run an LLM on your own computer

Visit llamafile is the new best way to run an LLM on your own computer

Mozilla’s innovation group and Justine Tunney just released llamafile, and I think it’s now the single best way to get started running Large Language Models (think your own local copy of ChatGPT) on your own computer.

[... 650 words]

Build an image search engine with llm-clip, chat with models with llm chat

Visit Build an image search engine with llm-clip, chat with models with llm chat

LLM is my combination CLI tool and Python library for working with Large Language Models. I just released LLM 0.10 with two significant new features: embedding support for binary files and the llm chat command.

[... 1,188 words]

Running my own LLM (via) Nelson Minar describes running LLMs on his own computer using my LLM tool and llm-gpt4all plugin, plus some notes on trying out some of the other plugins.

# 16th August 2023, 10:42 pm / llm, nelson-minar, local-llms, llms

Run Llama 2 on your own Mac using LLM and Homebrew

Llama 2 is the latest commercially usable openly licensed Large Language Model, released by Meta AI a few weeks ago. I just released a new plugin for my LLM utility that adds support for Llama 2 and many other llama-cpp compatible models.

[... 1,423 words]

Llama 2: The New Open LLM SOTA. I’m in this Latent Space podcast, recorded yesterday, talking about the Llama 2 release.

# 19th July 2023, 5:37 pm / generative-ai, llama, ai, local-llms, podcasts

llama2-mac-gpu.sh (via) Adrien Brault provided this recipe for compiling llama.cpp on macOS with GPU support enabled (“LLAMA_METAL=1 make”) and then downloading and running a GGML build of Llama 2 13B.

# 19th July 2023, 4:04 am / macos, generative-ai, llama, ai, local-llms, llms, llama-cpp

Ollama (via) This tool for running LLMs on your own laptop directly includes an installer for macOS (Apple Silicon) and provides a terminal chat interface for interacting with models. They already have Llama 2 support working, with a model that downloads directly from their own registry service without need to register for an account or work your way through a waiting list.

# 18th July 2023, 9 pm / generative-ai, llama, ai, local-llms, llms, ollama

Accessing Llama 2 from the command-line with the llm-replicate plugin

Visit Accessing Llama 2 from the command-line with the llm-replicate plugin

The big news today is Llama 2, the new openly licensed Large Language Model from Meta AI. It’s a really big deal:

[... 1,206 words]

Weeknotes: Self-hosted language models with LLM plugins, a new Datasette tutorial, a dozen package releases, a dozen TILs

A lot of stuff to cover from the past two and a half weeks.

[... 1,742 words]

My LLM CLI tool now supports self-hosted language models via plugins

Visit My LLM CLI tool now supports self-hosted language models via plugins

LLM is my command-line utility and Python library for working with large language models such as GPT-4. I just released version 0.5 with a huge new feature: you can now install plugins that add support for additional models to the tool, including models that can run on your own hardware.

[... 1,656 words]

Databricks Signs Definitive Agreement to Acquire MosaicML, a Leading Generative AI Platform. MosaicML are the team behind MPT-7B and MPT-30B, two of the most impressive openly licensed LLMs. They just got acquired by Databricks for $1.3 billion dollars.

# 30th June 2023, 1:43 am / open-source, generative-ai, ai, local-llms, llms

abacaj/mpt-30B-inference. MPT-30B, released last week, is an extremely capable Apache 2 licensed open source language model. This repo shows how it can be run on a CPU, using the ctransformers Python library based on GGML. Following the instructions in the README got me a working MPT-30B model on my M2 MacBook Pro. The model is a 19GB download and it takes a few seconds to start spitting out tokens, but it works as advertised.

# 29th June 2023, 3:27 am / llms, ai, local-llms, generative-ai, open-source, llm-release

MLC: Bringing Open Large Language Models to Consumer Devices (via) “We bring RedPajama, a permissive open language model to WebGPU, iOS, GPUs, and various other platforms.” I managed to get this running on my Mac (see via link) with a few tweaks to their official instructions.

# 22nd May 2023, 7:25 pm / generative-ai, mlc, redpajama, ai, llms, local-llms, webgpu, gpus

LocalAI (via) “Self-hosted, community-driven, local OpenAI-compatible API”. Designed to let you run local models such as those enabled by llama.cpp without rewriting your existing code that calls the OpenAI REST APIs. Reminds me of the various S3-compatible storage APIs that exist today.

# 14th May 2023, 1:05 pm / llms, ai, local-llms, generative-ai

Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs (via) There’s a lot to absorb about this one. Mosaic trained this model from scratch on 1 trillion tokens, at a cost of $200,000 taking 9.5 days. It’s Apache-2.0 licensed and the model weights are available today.

They’re accompanying the base model with an instruction-tuned model called MPT-7B-Instruct (licensed for commercial use) and a non-commercially licensed MPT-7B-Chat trained using OpenAI data. They also announced MPT-7B-StoryWriter-65k+—“a model designed to read and write stories with super long context lengths”—with a previously unheard of 65,000 token context length.

They’re releasing these models mainly to demonstrate how inexpensive and powerful their custom model training service is. It’s a very convincing demo!

# 5th May 2023, 7:05 pm / open-source, generative-ai, ai, local-llms, llms

No Moat: Closed AI gets its Open Source wakeup call — ft. Simon Willison (via) I joined the Latent Space podcast yesterday (on short notice, so I was out and about on my phone) to talk about the leaked Google memo about open source LLMs. This was a Twitter Space, but swyx did an excellent job of cleaning up the audio and turning it into a podcast.

# 5th May 2023, 6:17 pm / local-llms, generative-ai, ai, speaking, llms, podcasts, podcast-appearances