A practical guide to deploying Large Language Models Cheap, Good *and* Fast. Joel Kang’s extremely comprehensive notes on what he learned trying to run Vicuna-13B-v1.5 on an affordable cloud GPU server (a T4 at $0.615/hour). The space is in so much flux right now—Joel ended up using MLC but the best option could change any minute.
Vicuna 13B quantized to 4-bit integers needed 7.5GB of the T4’s 16GB of VRAM, and returned tokens at 20/second.
An open challenge running MLC right now is around batching and concurrency: “I did try making 3 concurrent requests to the endpoint, and while they all stream tokens back and the server doesn’t OOM, the output of all 3 streams seem to actually belong to a single prompt.”
Recent articles
- Weeknotes: Embeddings, more embeddings and Datasette Cloud - 17th September 2023
- Build an image search engine with llm-clip, chat with models with llm chat - 12th September 2023
- LLM now provides tools for working with embeddings - 4th September 2023
- Datasette 1.0a4 and 1.0a5, plus weeknotes - 30th August 2023
- Making Large Language Models work for you - 27th August 2023
- Datasette Cloud, Datasette 1.0a3, llm-mlc and more - 16th August 2023