Using pip to install a Large Language Model that’s under 100MB

7th February 2025

I just released llm-smollm2, a new plugin for LLM that bundles a quantized copy of the SmolLM2-135M-Instruct LLM inside of the Python package.

This means you can now pip install a full LLM!

If you’re already using LLM you can install it like this:

llm install llm-smollm2

Then run prompts like this:

llm -m SmolLM2 'Are dogs real?'

(New favourite test prompt for tiny models, courtesy of Tim Duffy. Here’s the result).

If you don’t have LLM yet first follow these installation instructions, or brew install llm or pipx install llm or uv tool install llm depending on your preferred way of getting your Python tools.

If you have uv setup you don’t need to install anything at all! The following command will spin up an ephemeral environment, install the necessary packages and start a chat session with the model all in one go:

uvx --with llm-smollm2 llm chat -m SmolLM2

Finding a tiny model

The fact that the model is almost exactly 100MB is no coincidence: that’s the default size limit for a Python package that can be uploaded to the Python Package Index (PyPI).

I asked on Bluesky if anyone had seen a just-about-usable GGUF model that was under 100MB, and Artisan Loaf pointed me to SmolLM2-135M-Instruct.

I ended up using this quantization by QuantFactory just because it was the first sub-100MB model I tried that worked.

Trick for finding quantized models: Hugging Face has a neat “model tree” feature in the side panel of their model pages, which includes links to relevant quantized models. I find most of my GGUFs using that feature.

Model tree for HuggingFaceTB/SmolLM2-135M-Instruct. 60 Quantizations, 6 adapters, 80 finetunes, 1 merge.

Building the plugin

I first tried the model out using Python and the llama-cpp-python library like this:

uv run --with llama-cpp-python python

Then:

from llama_cpp import Llama
from pprint import pprint
llm = Llama(model_path="SmolLM2-135M-Instruct.Q4_1.gguf")
output = llm.create_chat_completion(messages=[
    {"role": "user", "content": "Hi"}
])
pprint(output)

This gave me the output I was expecting:

{'choices': [{'finish_reason': 'stop',
              'index': 0,
              'logprobs': None,
              'message': {'content': 'Hello! How can I assist you today?',
                          'role': 'assistant'}}],
 'created': 1738903256,
 'id': 'chatcmpl-76ea1733-cc2f-46d4-9939-90efa2a05e7c',
 'model': 'SmolLM2-135M-Instruct.Q4_1.gguf',
 'object': 'chat.completion',
 'usage': {'completion_tokens': 9, 'prompt_tokens': 31, 'total_tokens': 40}}

But it also spammed my terminal with a huge volume of debugging output—which started like this:

llama_model_load_from_file_impl: using device Metal (Apple M2 Max) - 49151 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 272 tensors from SmolLM2-135M-Instruct.Q4_1.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama

And then continued for more than 500 lines!

I’ve had this problem with llama-cpp-python and llama.cpp in the past, and was sad to find that the documentation still doesn’t have a great answer for how to avoid this.

So I turned to the just released Gemini 2.0 Pro (Experimental), because I know it’s a strong model with a long input limit.

I ran the entire llama-cpp-python codebase through it like this:

cd /tmp
git clone https://github.com/abetlen/llama-cpp-python
cd llama-cpp-python
files-to-prompt -e py . -c | llm -m gemini-2.0-pro-exp-02-05 \
  'How can I prevent this library from logging any information at all while it is running - no stderr or anything like that'

Here’s the answer I got back. It recommended setting the logger to logging.CRITICAL, passing verbose=False to the constructor and, most importantly, using the following context manager to suppress all output:

from contextlib import contextmanager, redirect_stderr, redirect_stdout

@contextmanager
def suppress_output():
    """
    Suppresses all stdout and stderr output within the context.
    """
    with open(os.devnull, "w") as devnull:
        with redirect_stdout(devnull), redirect_stderr(devnull):
            yield

This worked! It turned out most of the output came from initializing the LLM class, so I wrapped that like so:

with suppress_output():
    model = Llama(model_path=self.model_path, verbose=False)

Proof of concept in hand I set about writing the plugin. I started with my simonw/llm-plugin cookiecutter template:

uvx cookiecutter gh:simonw/llm-plugin

  [1/6] plugin_name (): smollm2
  [2/6] description (): SmolLM2-135M-Instruct.Q4_1 for LLM
  [3/6] hyphenated (smollm2): 
  [4/6] underscored (smollm2): 
  [5/6] github_username (): simonw
  [6/6] author_name (): Simon Willison

The rest of the plugin was mostly borrowed from my existing llm-gguf plugin, updated based on the latest README for the llama-cpp-python project.

There’s more information on building plugins in the tutorial on writing a plugin.

Packaging the plugin

Once I had that working the last step was to figure out how to package it for PyPI. I’m never quite sure of the best way to bundle a binary file in a Python package, especially one that uses a pyproject.toml file... so I dumped a copy of my existing pyproject.toml file into o3-mini-high and prompted:

Modify this to bundle a SmolLM2-135M-Instruct.Q4_1.gguf file inside the package. I don’t want to use hatch or a manifest or anything, I just want to use setuptools.

Here’s the shared transcript—it gave me exactly what I wanted. I bundled it by adding this to the end of the toml file:

[tool.setuptools.package-data]
llm_smollm2 = ["SmolLM2-135M-Instruct.Q4_1.gguf"]

Then dropping that .gguf file into the llm_smollm2/ directory and putting my plugin code in llm_smollm2/__init__.py.

I tested it locally by running this:

python -m pip install build
python -m build

I fired up a fresh virtual environment and ran pip install ../path/to/llm-smollm2/dist/llm_smollm2-0.1-py3-none-any.whl to confirm that the package worked as expected.

Publishing to PyPI

My cookiecutter template comes with a GitHub Actions workflow that publishes the package to PyPI when a new release is created using the GitHub web interface. Here’s the relevant YAML:

  deploy:
    runs-on: ubuntu-latest
    needs: [test]
    environment: release
    permissions:
      id-token: write
    steps:
    - uses: actions/checkout@v4
    - name: Set up Python
      uses: actions/setup-python@v5
      with:
        python-version: "3.13"
        cache: pip
        cache-dependency-path: pyproject.toml
    - name: Install dependencies
      run: |
        pip install setuptools wheel build
    - name: Build
      run: |
        python -m build
    - name: Publish
      uses: pypa/gh-action-pypi-publish@release/v1

This runs after the test job has passed. It uses the pypa/gh-action-pypi-publish Action to publish to PyPI—I wrote more about how that works in this TIL.

Is the model any good?

This one really isn’t! It’s not really surprising but it turns out 94MB really isn’t enough space for a model that can do anything useful.

It’s super fun to play with, and I continue to maintain that small, weak models are a great way to help build a mental model of how this technology actually works.

That’s not to say SmolLM2 isn’t a fantastic model family. I’m running the smallest, most restricted version here. SmolLM—blazingly fast and remarkably powerful describes the full model family—which comes in 135M, 360M, and 1.7B sizes. The larger versions are a whole lot more capable.

If anyone can figure out something genuinely useful to do with the 94MB version I’d love to hear about it.

Posted 7th February 2025 at 6:34 am · Follow me on Mastodon, Bluesky, Twitter or subscribe to my newsletter

Simon Willison’s Weblog