Everything I've learned so far about running local LLMs (via) Chris Wellons shares detailed notes on his experience running local LLMs on Windows - though most of these tips apply to other operating systems as well.
This is great, there's a ton of detail here and the root recommendations are very solid: Use llama-server
from llama.cpp and try ~8B models first (Chris likes Llama 3.1 8B Instruct at Q4_K_M as a first model), anything over 10B probably won't run well on a CPU so you'll need to consider your available GPU VRAM.
This is neat:
Just for fun, I ported llama.cpp to Windows XP and ran a 360M model on a 2008-era laptop. It was magical to load that old laptop with technology that, at the time it was new, would have been worth billions of dollars.
I need to spend more time with Chris's favourite models, Mistral-Nemo-2407 (12B) and Qwen2.5-14B/72B.
Chris also built illume, a Go CLI tool for interacting with models that looks similar to my own LLM project.
Recent articles
- Live blog: the 12th day of OpenAI - "Early evals for OpenAI o3" - 20th December 2024
- December in LLMs has been a lot - 20th December 2024
- Gemini 2.0 Flash "Thinking mode" - 19th December 2024